inixiative presents:

Do I Still Have a Job?

STATUS: YES

An AI coding benchmark on the inixiative ecosystem.

Every model launch swears it can "replace your senior engineers." Adorable. So we took the inixiative ecosystem's hardest primitives — the small, brutal ones real engineers actually sweated over — handed each agent the same product brief a human gets from a kickoff, and let it cook. Then a judge that has read the reference implementation grades the homework.

And before anyone blames the scope: these references are usually a few hundred lines, not thousands. That's the trap. Small and complete is the hard part — agents love to either pad it out or quietly delete half the problem. The reference did neither.

The industry wants to hand these things the keys and walk away. Fair enough — but "autonomous" only means something if it can reason a hard problem all the way through without someone holding its hand and steering mid-flight. So we don't steer: one brief, one sandbox, zero hints, zero "are you sure?". Off the leash. Let's see how far they get before they need an adult in the room.

Getting the tests to pass is the participation trophy. We score the part that's hard. Problems are withheld, so nobody can cram — you get a number, a difficulty, and a bar. The agents get perspective.

flame metric:
bars rescale to the chosen metric (per test); color = completion

Test #1

architect 8 contenders
Claude Opus 4.8
solo · low
27.5/48 57% c 13/20 v3 i1 30,099t 436s
partial even-more-convinced principal
Claude Opus 4.8
solo · medium
27.5/48 57% c 15/20 v2 i1 45,171t 621s
partial even-more-convinced principal
Claude Sonnet 4.6
solo · xhigh
24.5/48 51% c 13/20 v2 i0 45,965t 723s
fail even-more-convinced principal
GPT-5.5
solo · high
23/48 48% c 12/20 v3 i0 —t 441s
fail even-more-convinced staff
Claude Sonnet 4.6
solo · medium
22.5/48 47% c 11/20 v4 i0 29,936t 509s
partial even-more-convinced staff
Claude Sonnet 4.6
solo · low
16/48 33% c 8/20 v5 i0 25,339t 424s
fail even-more-convinced principal
Claude Sonnet 4.6
solo · high
16/48 33% c 10/20 v3 i0 42,411t 708s
fail even-more-convinced staff
Claude Sonnet 4.6
solo · max
14.5/48 30% c 12/20 v3 i1 47,809t 823s
fail even-more-convinced principal

Test #2

architect 1 contender
Claude Sonnet 4.6
solo · default
21.5/49 44% c 4/9 v1 i0 22,834t 395s
fail principal

Bar length = the selected metric (rescaled per test); green = complete, amber = partial, red = fail. self = the difficulty the model thought it was facing. golden = what the run did to our confidence in the reference (so far: more convinced, not less).