inixiative presents:

Do I Still Have a Job?

STATUS: YES

An AI coding benchmark on the inixiative ecosystem.

Every model launch swears it can "replace your senior engineers." Adorable. So we took the inixiative ecosystem's hardest primitives — the small, brutal ones real engineers actually sweated over — handed each agent the same product brief a human gets from a kickoff, and let it cook. Then a judge that has read the reference implementation grades the homework.

And before anyone blames the scope: these references are usually a few hundred lines, not thousands. That's the trap. Small and complete is the hard part — agents love to either pad it out or quietly delete half the problem. The reference did neither.

The industry wants to hand these things the keys and walk away. Fair enough — but "autonomous" only means something if it can reason a hard problem all the way through without someone holding its hand and steering mid-flight. So we don't steer: one brief, one sandbox, zero hints, zero "are you sure?". Off the leash. Let's see how far they get before they need an adult in the room.

Getting the tests to pass is the participation trophy. We score the part that's hard. Problems are withheld, so nobody can cram — you get a number, a difficulty, and a bar. The agents get perspective.

flame metric:

bars rescale to the chosen metric (per test); color = completion

Test #1

architect 8 contenders

Claude Opus 4.8

solo · low

27.5/48 57% c 13/20 v3 i1 30,099t 436s

partial even-more-convinced principal

Claude Opus 4.8

solo · medium

27.5/48 57% c 15/20 v2 i1 45,171t 621s

partial even-more-convinced principal

Claude Sonnet 4.6

solo · xhigh

24.5/48 51% c 13/20 v2 i0 45,965t 723s

fail even-more-convinced principal

GPT-5.5

solo · high

23/48 48% c 12/20 v3 i0 —t 441s

fail even-more-convinced staff

Claude Sonnet 4.6

solo · medium

22.5/48 47% c 11/20 v4 i0 29,936t 509s

partial even-more-convinced staff

Claude Sonnet 4.6

solo · low

16/48 33% c 8/20 v5 i0 25,339t 424s

fail even-more-convinced principal

Claude Sonnet 4.6

solo · high

16/48 33% c 10/20 v3 i0 42,411t 708s

fail even-more-convinced staff

Claude Sonnet 4.6

solo · max

14.5/48 30% c 12/20 v3 i1 47,809t 823s

fail even-more-convinced principal

Test #2

architect 1 contender

Claude Sonnet 4.6

solo · default

21.5/49 44% c 4/9 v1 i0 22,834t 395s

fail principal

Bar length = the selected metric (rescaled per test); green = complete, amber = partial, red = fail. self = the difficulty the model thought it was facing. golden = what the run did to our confidence in the reference (so far: more convinced, not less).