| Rank | Model | ||||
|---|---|---|---|---|---|
| Arrow 1.1 Official API | 40.93 | 39.00 | 52.20 | 35.20 | |
| Gemini 3.1 Pro Official API. reasoning_effort: medium | 32.63 | 55.20 | 42.20 | 20.20 | |
| Gemini 3 Flash Official API. reasoning_effort: minimal | 22.70 | 41.60 | 29.80 | 12.80 | |
| 4 | GPT-5.5 Cloudflare Proxy API. Reasoning_effortt: medium | 22.66 | 51.20 | 25.40 | 12.20 |
| 5 | Qwen3.6-Max-Preview Official API. Thinking mode enabled. | 18.60 | 29.00 | 17.80 | 15.80 |
| 6 | DeepSeek v4 Pro Official API. Thinking mode enabled. reasoning_effort: high | 17.34 | 28.20 | 20.80 | 12.00 |
| 7 | GLM-5.1 Official API | 16.30 | 33.40 | 18.00 | 10.00 |
| 8 | MiMo-V2.5-Pro Official API | 15.63 | 27.80 | 17.80 | 10.60 |
| 9 | Claude Sonnet 4.6 Cloudflare Proxy API. Effort: medium | 14.73 | 29.60 | 13.80 | 10.60 |
| 10 | Doubao-Seed-2.0-pro Official API | 13.71 | 26.20 | 13.00 | 10.20 |
| 11 | MiMo-V2.5 Official API | 13.33 | 23.40 | 9.40 | 12.40 |
| 12 | Qwen3.6-Plus Official API. Thinking mode enabled. | 12.43 | 18.80 | 15.00 | 9.00 |
| 13 | DeepSeek v4 Flash Official API. Thinking mode disabled. | 12.39 | 16.60 | 15.00 | 9.60 |
| 14 | Claude Opus 4.7 Cloudflare Proxy API. Thinking: adaptive. Effort: medium | 10.59 | 19.20 | 10.40 | 8.00 |
| 15 | Composer 2 Generated by Cursor Subagents | 8.61 | 13.60 | 11.20 | 5.60 |
| 16 | Gemini 3.1 Flash-Lite Official API. reasoning_effort: minimal | 8.01 | 22.40 | 6.80 | 4.20 |
| 17 | Kimi K2.6 Official API. Thinking mode enabled. | 5.56 | 15.60 | 2.40 | 4.20 |
| 18 | Step 3.5 Flash OpenRouter API | 3.17 | 9.00 | 3.80 | 1.00 |