Latest frontier flagship per lab · Excludes OpenAI, Anthropic, Google Gemini models. Numbers are vendor-reported from official READMEs/blogs unless marked as independent eval.
| Model | Access | Input (cache miss) | Output | Cached input | Notes |
|---|---|---|---|---|---|
| MiniMax M3 | API | $0.30 | $1.20 | $0.06 | ≤512k ctx · 50% launch promo (list $0.60 / $2.40) |
| DeepSeek V4-Pro Max | Open + API | $0.435 | $0.87 | $0.003625 | MIT weights · 1M ctx |
| Kimi K2.6 | Open + API | $0.95 | $4.00 | $0.16 | Mod. MIT · 256k ctx · thinking default |
| GLM-5.1 Thinking | Open + API | $1.40 | $4.40 | $0.26 | MIT weights · 202k ctx |
| Qwen3.7-Max | API only | $2.50 | $7.50 | $0.25 | Text agent · 1M ctx · Alibaba Model Studio |
| Qwen3.7-Plus | API only | $0.40 | $1.60 | - | Multimodal · ≤256k tier · $1.20 / $4.80 above 256k |
| MiMo-V2.5-Pro | Open + API | $1.00 | $3.00 | $0.20 | MIT weights · ≤256k tier · overseas API |
| Step 3.7 Flash | Open + API | $0.20 | $1.15 | $0.04 | Apache 2.0 · 256k ctx |
Pricing cross-checked 2026-06-01 · Qwen3.7-Plus added · Long-context tiers cost more · Plus not on OpenRouter yet
| Benchmark | M3 | DS V4-Pro | K2.6 | GLM-5.1 | Q3.7-Max | Q3.7-Plus | MiMo-V2.5 | Step-3.7 |
|---|---|---|---|---|---|---|---|---|
| AA Intelligence Index v4.0 ● | - | 52 | 54 | 51 | 57 | - | 54 | - |
| Vals Index ● | - | 56.23% | 55.55% | 52.14% | 57.29% | - | - | - |
AA / Vals cross-checked 2026-06-01 · Q3.7-Plus not on AA or Vals yet · Plus numbers from qwen.ai/blog
| Benchmark | M3 | DS V4-Pro Max | K2.6 | GLM-5.1 | Q3.7-Max | Q3.7-Plus | MiMo-V2.5-Pro | Step-3.7 |
|---|---|---|---|---|---|---|---|---|
| Knowledge & Reasoning | ||||||||
| MMLU-Pro | - | 87.5 | - | - | 89.6 | 88.5 | 68.5 | - |
| GPQA Diamond | - | 90.1 | 90.5 | 86.2 | 92.4 | 90.3 | 66.7 | 78.41 |
| HLE (no tools) | - | 37.7 | 34.7 | 31.0 | 41.4 | 34.7 | - | 49.7 |
| HLE w/ tools | - | 48.2 | 54.0 | 52.3 | - | - | - | 48.1 |
| AIME 2026 | - | - | 96.4 | 95.3 | - | - | - | - |
| LiveCodeBench v6 | - | 93.5 | 89.6 | - | 91.6 | 89.6 | 39.6 | - |
| IMOAnswerBench | - | 89.8 | 86.0 | 83.8 | - | 86.0 | - | - |
| Apex Math Reasoning | - | 38.3 | - | - | 44.5 | 22.7 | - | - |
| Coding & Agentic | ||||||||
| SWE-bench Verified | - | 80.6 | 80.2 | - | 80.4 | 77.7 | 78.9 | 76.5 |
| SWE-bench Pro | 59.0 | 55.4 | 58.6 | 58.4 | 60.6 | 57.6 | 57.2 | 56.3 |
| SWE-bench Multilingual | - | 76.2 | 76.7 | - | 78.3 | 75.8 | - | 72.4 |
| Terminal-Bench 2.0 / 2.1 | 66.0 (2.1) | 67.9 (2.0) | 66.7 (2.0) | 63.5 (2.0) | 69.7 (2.0) | 70.3 (2.0) | 68.4 (2.0) | 59.6 (2.1) |
| NL2Repo | - | - | - | 42.7 | - | 41.1 | - | - |
| BrowseComp | - | 83.4 | 83.2 | 68.0 | - | - | - | 75.82 |
| Toolathlon | - | 51.8 | 50.0 | 40.7 | - | - | - | 49.5 |
| MCPAtlas Public | 74.2 | 73.6 | - | 71.8 | 76.4 | 73.2 | - | - |
| ClawEval Pass³ | - | - | 62.3 | 62.7 | 65.2 | 62.7 | 64.0 | 67.1 |
| GDPval-AA (Elo) | - | 1554 | - | - | - | - | - | 1415.8 |
| τ²-Bench | - | - | - | - | - | - | - | - |
| Skillsbench | - | - | - | - | 59.2 | 54.9 | - | - |
| SciCode | - | - | 52.2 | - | 53.5 | 51.3 | - | - |
| CyberGym | - | - | - | 68.7 | - | - | - | - |
| KernelBench Hard | 28.8 | - | - | - | - | - | - | - |
| SWE-fficiency | 34.8 | - | - | - | - | - | - | - |
| PostTrainBench | 0.37 | - | - | - | - | - | - | - |
| SimpleVQA Search | - | - | - | - | - | 81.7 | - | 79.16 |
| Long Context | ||||||||
| MRCR 1M | - | 83.5 | - | - | - | - | - | - |
| CorpusQA 1M | - | 62.0 | - | - | - | - | - | - |