ComputeBench
Instruction-following benchmarks for long, step-by-step arithmetic. We track how well models keep state without drifting across many small steps.
Data
Browse run data for raw prompts, model outputs, and expected answers.
Ranking
Ranked by exact match rate (full output matches expected), then answer accuracy.
| Model | Exact match rate | Answer accuracy | Avg prefix match | Samples |
|---|---|---|---|---|
| google/gemini-3-pro-preview | 100.0% | 100.0% | 100.0% | 10 |
| anthropic/claude-sonnet-4.5 | 40.0% | 50.0% | 78.5% | 10 |
| x-ai/grok-4.1-fast | 40.0% | 40.0% | 62.9% | 10 |
| google/gemini-3-flash-preview | 30.0% | 80.0% | 56.5% | 10 |
| openai/gpt-5.2 | 20.0% | 80.0% | 61.6% | 10 |
| x-ai/grok-code-fast-1 | 10.0% | 10.0% | 28.5% | 10 |
| google/gemini-2.5-flash | 0.0% | 30.0% | 21.9% | 10 |
| openai/gpt-4.1 | 0.0% | 0.0% | 9.0% | 10 |
Scoring
We compare outputs line-by-line starting at the first "Step 1:" line and ignore any preamble. Rates are averages across samples.
- Exact match rate. Output matches expected exactly from Step 1 through the final Answer line, including step counts and results.
-
Answer accuracy. The final
Answer: Nvalue matches, even if step text differs. - Format OK rate. Every line follows the required syntax (Step lines or Answer line) with no extra text.
- Avg prefix match. Fraction of expected steps that match in order from Step 1 until the first mismatch, averaged across samples.
Example prompt
Models are asked to show every step and end with a single Answer line.
Steps: 80 Problem: 3400139 * 43270486 Rounding: none Step 1: (9 * 6) + 0 = 54; write 4 at 10^0, carry 5; row_partial=4; result=4 Step 2: (3 * 6) + 5 = 23; write 3 at 10^1, carry 2; row_partial=34; result=34 Step 3: (1 * 6) + 2 = 8; write 8 at 10^2, carry 0; row_partial=834; result=834 Step 4: (0 * 6) + 0 = 0; write 0 at 10^3, carry 0; row_partial=834; result=834 ... Answer: 147125666997554