Do I Still Have a Job?
STATUS: YES
An AI coding benchmark on the inixiative ecosystem.
Every model launch swears it can "replace your senior engineers." Adorable. So we took the inixiative ecosystem's hardest primitives — the small, brutal ones real engineers actually sweated over — handed each agent the same product brief a human gets from a kickoff, and let it cook. Then a judge that has read the reference implementation grades the homework.
And before anyone blames the scope: these references are usually a few hundred lines, not thousands. That's the trap. Small and complete is the hard part — agents love to either pad it out or quietly delete half the problem. The reference did neither.
The industry wants to hand these things the keys and walk away. Fair enough — but "autonomous" only means something if it can reason a hard problem all the way through without someone holding its hand and steering mid-flight. So we don't steer: one brief, one sandbox, zero hints, zero "are you sure?". Off the leash. Let's see how far they get before they need an adult in the room.
Getting the tests to pass is the participation trophy. We score the part that's hard. Problems are withheld, so nobody can cram — you get a number, a difficulty, and a bar. The agents get perspective.
Test #1
architect 8 contendersTest #2
architect 1 contenderBar length = the selected metric (rescaled per test); green = complete, amber = partial, red = fail. self = the difficulty the model thought it was facing. golden = what the run did to our confidence in the reference (so far: more convinced, not less).