ComputeBench

Instruction-following benchmarks for long, step-by-step arithmetic. We track how well models keep state without drifting across many small steps.

Planned: multiplication, long division, square roots Runs: --a-digits 7 --b-digits 8

Data

Browse run data for raw prompts, model outputs, and expected answers.

Ranking

Ranked by exact match rate (full output matches expected), then answer accuracy.

Model	Exact match rate	Answer accuracy	Avg prefix match	Samples
google/gemini-3-pro-preview	100.0%	100.0%	100.0%	10
anthropic/claude-sonnet-4.5	40.0%	50.0%	78.5%	10
x-ai/grok-4.1-fast	40.0%	40.0%	62.9%	10
google/gemini-3-flash-preview	30.0%	80.0%	56.5%	10
openai/gpt-5.2	20.0%	80.0%	61.6%	10
x-ai/grok-code-fast-1	10.0%	10.0%	28.5%	10
google/gemini-2.5-flash	0.0%	30.0%	21.9%	10
openai/gpt-4.1	0.0%	0.0%	9.0%	10

Scoring

We compare outputs line-by-line starting at the first "Step 1:" line and ignore any preamble. Rates are averages across samples.

Exact match rate. Output matches expected exactly from Step 1 through the final Answer line, including step counts and results.
Answer accuracy. The final Answer: N value matches, even if step text differs.
Format OK rate. Every line follows the required syntax (Step lines or Answer line) with no extra text.
Avg prefix match. Fraction of expected steps that match in order from Step 1 until the first mismatch, averaged across samples.

Example prompt

Models are asked to show every step and end with a single Answer line.

Steps: 80
Problem: 3400139 * 43270486
Rounding: none
Step 1: (9 * 6) + 0 = 54; write 4 at 10^0, carry 5; row_partial=4; result=4
Step 2: (3 * 6) + 5 = 23; write 3 at 10^1, carry 2; row_partial=34; result=34
Step 3: (1 * 6) + 2 = 8; write 8 at 10^2, carry 0; row_partial=834; result=834
Step 4: (0 * 6) + 0 = 0; write 0 at 10^3, carry 0; row_partial=834; result=834
...
Answer: 147125666997554

Instruction following metrics — Exact match rate, answer accuracy, and format OK rate per model.

Average prefix match ratio — How far the model stays correct before the first drift, averaged over samples.