Latest frontier flagship per lab · Excludes OpenAI, Anthropic, Google Gemini models. Numbers are vendor-reported from official READMEs/blogs unless marked as independent eval.
| Model | Access | Input (cache miss) | Output | Cached input | Notes |
|---|---|---|---|---|---|
| MiniMax M3 | API | $0.30 | $1.20 | $0.06 | ≤512k ctx · 50% launch promo (list $0.60 / $2.40) |
| DeepSeek V4-Pro Max | Open + API | $0.435 | $0.87 | $0.003625 | MIT weights · 1M ctx |
| Kimi K2.6 | Open + API | $0.95 | $4.00 | $0.16 | Mod. MIT · 256k ctx · thinking default |
| GLM-5.1 Thinking | Open + API | $1.40 | $4.40 | $0.26 | MIT weights · 202k ctx |
| Qwen3.7-Max | API only | $2.50 | $7.50 | $0.25 | Text agent · 1M ctx · Alibaba Model Studio |
| Qwen3.7-Plus | API only | $0.40 | $1.60 | - | Multimodal · ≤256k tier · $1.20 / $4.80 above 256k |
| MiMo-V2.5-Pro | Open + API | $1.00 | $3.00 | $0.20 | MIT weights · ≤256k tier · overseas API |
| Step 3.7 Flash | Open + API | $0.20 | $1.15 | $0.04 | Apache 2.0 · 256k ctx |
Pricing cross-checked 2026-06-01 · Qwen3.7-Plus added · Long-context tiers cost more · Plus not on OpenRouter yet
| Benchmark | M3 | DS V4-Pro | K2.6 | GLM-5.1 | Q3.7-Max | Q3.7-Plus | MiMo-V2.5 | Step-3.7 |
|---|---|---|---|---|---|---|---|---|
| AA Intelligence Index v4.0 ● | - | 52 | 54 | 51 | 57 | - | 54 | - |
| Vals Index ● | - | 56.23% | 55.55% | 52.14% | 57.29% | - | - | - |
AA / Vals cross-checked 2026-06-01 · Q3.7-Plus not on AA or Vals yet · Plus numbers from qwen.ai/blog
| Benchmark | M3 | DS V4-Pro Max | K2.6 | GLM-5.1 | Q3.7-Max | Q3.7-Plus | MiMo-V2.5-Pro | Step-3.7 |
|---|---|---|---|---|---|---|---|---|
| Knowledge & Reasoning | ||||||||
| MMLU-Pro | - | 87.5 | - | - | 89.6 | 88.5 | 68.5 | - |
| GPQA Diamond | - | 90.1 | 90.5 | 86.2 | 92.4 | 90.3 | 66.7 | 78.41 |
| HLE (no tools) | - | 37.7 | 34.7 | 31.0 | 41.4 | 34.7 | - | 49.7 |
| HLE w/ tools | - | 48.2 | 54.0 | 52.3 | - | - | - | 48.1 |
| AIME 2026 | - | - | 96.4 | 95.3 | - | - | - | - |
| LiveCodeBench v6 | - | 93.5 | 89.6 | - | 91.6 | 89.6 | 39.6 | - |
| IMOAnswerBench | - | 89.8 | 86.0 | 83.8 | - | 86.0 | - | - |
| Apex Math Reasoning | - | 38.3 | - | - | 44.5 | 22.7 | - | - |
| Coding & Agentic | ||||||||
| SWE-bench Verified | - | 80.6 | 80.2 | - | 80.4 | 77.7 | 78.9 | 76.5 |
| SWE-bench Pro | 59.0 | 55.4 | 58.6 | 58.4 | 60.6 | 57.6 | 57.2 | 56.3 |
| SWE-bench Multilingual | - | 76.2 | 76.7 | - | 78.3 | 75.8 | - | 72.4 |
| Terminal-Bench 2.0 / 2.1 | 66.0 (2.1) | 67.9 (2.0) | 66.7 (2.0) | 63.5 (2.0) | 69.7 (2.0) | 70.3 (2.0) | 68.4 (2.0) | 59.6 (2.1) |
| NL2Repo | - | - | - | 42.7 | - | 41.1 | - | - |
| BrowseComp | - | 83.4 | 83.2 | 68.0 | - | - | - | 75.82 |
| Toolathlon | - | 51.8 | 50.0 | 40.7 | - | - | - | 49.5 |
| MCPAtlas Public | 74.2 | 73.6 | - | 71.8 | 76.4 | 73.2 | - | - |
| ClawEval Pass³ | - | - | 62.3 | 62.7 | 65.2 | 62.7 | 64.0 | 67.1 |
| GDPval-AA (Elo) | - | 1554 | - | - | - | - | - | 1415.8 |
| τ²-Bench | - | - | - | - | - | - | - | - |
| Skillsbench | - | - | - | - | 59.2 | 54.9 | - | - |
| SciCode | - | - | 52.2 | - | 53.5 | 51.3 | - | - |
| CyberGym | - | - | - | 68.7 | - | - | - | - |
| KernelBench Hard | 28.8 | - | - | - | - | - | - | - |
| SWE-fficiency | 34.8 | - | - | - | - | - | - | - |
| PostTrainBench | 0.37 | - | - | - | - | - | - | - |
| SimpleVQA Search | - | - | - | - | - | 81.7 | - | 79.16 |
| Long Context | ||||||||
| MRCR 1M | - | 83.5 | - | - | - | - | - | - |
| CorpusQA 1M | - | 62.0 | - | - | - | - | - | - |
Official harness measuring pass rate on Next.js generation and migration workloads. Not comparable to SWE-bench-style vendor tables above. AGENTS.md column = extra passes when agents had bundled Next.js docs.
| Metric | M3 | DS V4-Pro | K2.6 | GLM-5.1 | Q3.7-Max | Q3.7-Plus | MiMo-V2.5 | Step-3.7 |
|---|---|---|---|---|---|---|---|---|
| Success rate ● | 75% | - | - | 75% | - | - | - | - |
| Success w/ AGENTS.md | 96% | - | - | 100% | - | - | - | - |
| Avg execution time | 181.30s | - | - | 254.36s | - | - | - | - |
| Harness (on eval) | OpenCode | - | - | OpenCode | - | - | - | - |
Only MiniMax M3 and GLM 5.1 from this page match our matrix. Kimi K2.6, Qwen3.7, DeepSeek, MiMo, Step not listed yet. Eval still shows Kimi K2.5 (21% / 135s) and MiniMax M2.7 (50% / 294s) on the same leaderboard.
LMArena via arena-ai-leaderboards snapshot 2026-06-02 (top 20 per board). API-only columns (Q3.7-Max, Q3.7-Plus) get LMArena / LiveBench where listed; OLLB row is open-weight only.
| Metric | M3 | DS V4-Pro | K2.6 | GLM-5.1 | Q3.7-Max | Q3.7-Plus | MiMo-V2.5 | Step-3.7 | |
|---|---|---|---|---|---|---|---|---|---|
| LMArena Text Elo ● | - | - | - | 1474 | 1475 | - | - | - | |
| LMArena Code Elo ● | - | 1464 | 1518 | 1533 | 1541 | - | 1471 | - | |
| LiveBench global avg ● | - | 73.58 | 72.17 | 70.18 | 74.29 | - | - | - | - |
| LiveBench agentic coding ● | - | 56.67 | 58.33 | 55.00 | 51.67 | - | - | - | - |
| Open LLM LB Average ● | TBD | - | - | - | - | - | - | - | - |
| EQ-Bench Creative Writing Elo ● | - | 1569.9 | 1781.6 | 1644.8 | - | - | - | - | - |
LMArena arena IDs: Text qwen3.7-max-preview (1475), glm-5.1 (1474);
Code qwen3.7-max-20260517 (1541), glm-5.1 (1533), kimi-k2.6 (1518),
deepseek-v4-pro-thinking (1464), mimo-v2.5-pro (1471). Kimi / DeepSeek / MiMo not in Text top 20.
LiveBench from livebench.ai (Jun 2026 crawl).
OLLB: none of the five submitted HF weights appear in open-llm-leaderboard/contents yet (checked parquet, 4576 entries).
EQ-Bench Elo from eqbench.com/creative_writing (Kimi-K2.6, DeepSeek-V4-Pro, GLM-5.1 HF entries only).
open-llm-leaderboard/contents