ComputeBench

Instruction-following benchmarks for long, step-by-step arithmetic. We track how well models keep state without drifting across many small steps.

Planned: multiplication, long division, square roots Runs: --a-digits 7 --b-digits 8

Data

Browse run data for raw prompts, model outputs, and expected answers.

Ranking

Ranked by exact match rate (full output matches expected), then answer accuracy.

ModelExact match rateAnswer accuracyAvg prefix matchSamples
google/gemini-3-pro-preview100.0%100.0%100.0%10
anthropic/claude-sonnet-4.540.0%50.0%78.5%10
x-ai/grok-4.1-fast40.0%40.0%62.9%10
google/gemini-3-flash-preview30.0%80.0%56.5%10
openai/gpt-5.220.0%80.0%61.6%10
x-ai/grok-code-fast-110.0%10.0%28.5%10
google/gemini-2.5-flash0.0%30.0%21.9%10
openai/gpt-4.10.0%0.0%9.0%10

Scoring

We compare outputs line-by-line starting at the first "Step 1:" line and ignore any preamble. Rates are averages across samples.

Example prompt

Models are asked to show every step and end with a single Answer line.

Steps: 80
Problem: 3400139 * 43270486
Rounding: none
Step 1: (9 * 6) + 0 = 54; write 4 at 10^0, carry 5; row_partial=4; result=4
Step 2: (3 * 6) + 5 = 23; write 3 at 10^1, carry 2; row_partial=34; result=34
Step 3: (1 * 6) + 2 = 8; write 8 at 10^2, carry 0; row_partial=834; result=834
Step 4: (0 * 6) + 0 = 0; write 0 at 10^3, carry 0; row_partial=834; result=834
...
Answer: 147125666997554
Instruction following metrics
Exact match rate, answer accuracy, and format OK rate per model.
Average prefix match ratio
How far the model stays correct before the first drift, averaged over samples.