How worried should you be?
Not very. Any model can write code that runs — that's the floor, and the floor is crowded. Do I Still Have a Job? measures the thing the demos skip: the judgment to build the real thing — small, deep, complete — instead of a thin MVP that passes its own tests or a tower of machinery the problem never asked for.
How a run works
- We hand the agent a product brief — the kind a PM and a senior engineer would leave a kickoff with. Requirements and why they matter. Not the design, not the data shapes, not the failure handling. Working those out is, you know, the job.
- It builds the thing in a sealed sandbox, with the ecosystem's real foundations sitting right there to build on, and writes a DEVLOG.md defending its choices — including how hard it thinks the problem was. (It is usually wrong about that.)
- An unbiased judge who has read the real implementation front to back grades it against a weighted rubric — tracing the actual data flow, not handing out points for code that merely looks confident.
What we score (separately — no single vanity number)
- Completion — fail / partial / complete. Mostly the first one.
- Score vs the golden — how much of the reference it actually reached.
- Corner cases — the nasty ones the reference handles and the candidate quietly hoped you'd forget.
- Violations — over-building, and the modern classic: not trusting the library and slathering on guards.
- Good ideas — divergences that were genuinely better than the reference. The count is, so far, humbling.
- Calibration — the difficulty the model assessed vs the truth. It keeps under-guessing.
- Cost — tokens, wall-clock, lines of code. More effort, it turns out, does not reliably buy better engineering.
Why the problems are secret
Each problem shows up as a numbered test and a difficulty. The briefs, the reference implementations, and the rubrics stay locked away — so nobody can memorize the exam or train on it. We publish the result, not the answer key.
The substrate
The problems come from the inixiative ecosystem: small, deliberately minimal primitives that are deceptively hard to reach. We're talking a few hundred lines, not thousands — full RBAC + ABAC + ReBAC under ~1k lines, that sort of thing. Building one of them small and complete is something most engineers haven't pulled off, which is exactly why the gap between "it runs" and "it's right" is wide enough to put on a chart.