Frontier Model Benchmarks (Jun 2026)

Latest frontier flagship per lab · Chinese labs + NVIDIA · Excludes OpenAI, Anthropic, Google Gemini. Numbers are vendor-reported from official READMEs/blogs unless marked as independent eval.

As of 1 Jun 2026 8 labs · 9 models Sources: vendor · AA · Vals · Next.js · LMArena · LiveBench · OLLB · EQ-Bench Pricing: official API · USD / M tokens

Latest model per lab

MiniMax
M3
Jun 1 · 428B/23B act · 1M ctx · MSA · MiniMax license
$0.30 / $1.20 in · out
AA Intelligence Index 55 · weights released
DeepSeek
V4-Pro Max
Apr 24 · 1.6T/49B act · 1M ctx · MIT
$0.435 / $0.87 in · out
Moonshot
Kimi K2.6
Apr 21 · 1T/32B act · 256K · Mod. MIT · general flagship
$0.95 / $4.00 in · out
Moonshot
Kimi K2.7-Code
Jun 12 · 1T/32B act · 256K · Mod. MIT · code-specialized
pricing TBD
Code sibling of K2.6. Vendor benches: Kimi Code Bench v2 62.0 · Program Bench 53.6 · MCP Mark Verified 81.1 · MCP Atlas 76.0. Only MCP Atlas maps to the matrix — shown as (K2.7).
Z.ai
GLM-5.2
Jun 13 · 1M ctx · MIT (weights ~1wk) · supersedes 5.1
Coding Plan ~$18/mo · API pending
Benchmarks pending — Zhipu published none at launch. Matrix below shows verified GLM-5.1.
Alibaba
Qwen3.7-Max
May 19 · 1M ctx · text agent · API-only
$2.50 / $7.50 in · out
Alibaba
Qwen3.7-Plus
May 26 · 1M ctx · multimodal · API-only
$0.40 / $1.60 in · out
Xiaomi
MiMo-V2.5-Pro
Apr 22 · 1.02T/42B act · 1M · MIT
$1.00 / $3.00 in · out
StepFun
Step 3.7 Flash
May 29 · 198B/11B act · 256K · Apache 2.0
$0.20 / $1.15 in · out
NVIDIA (US)
Nemotron 3 Ultra
Jun 4 · 550B/55B act · 1M ctx · OpenMDW · Mamba-Transf.
open weights · self-host / NIM
US open-weight leader, below the Chinese frontier · AA Intelligence Index 47.7.

API pricing (USD per 1M tokens · official vendor)

Purple highlight = lowest in row Open-weight models also support self-hosting (no per-token API fee)
Model Access Input (cache miss) Output Cached input Notes
MiniMax M3 API $0.30 $1.20 $0.06 ≤512k ctx · 50% launch promo (list $0.60 / $2.40)
DeepSeek V4-Pro Max Open + API $0.435 $0.87 $0.003625 MIT weights · 1M ctx
Kimi K2.6 Open + API $0.95 $4.00 $0.16 Mod. MIT · 256k ctx · thinking default · K2.7-Code (code) pricing TBD
GLM-5.2 (5.1 pricing) Open + API $1.40 $4.40 $0.26 5.2 launched Jun 13 (1M ctx); standalone API pricing pending — shows GLM-5.1 rates
Qwen3.7-Max API only $2.50 $7.50 $0.25 Text agent · 1M ctx · Alibaba Model Studio
Qwen3.7-Plus API only $0.40 $1.60 - Multimodal · ≤256k tier · $1.20 / $4.80 above 256k
MiMo-V2.5-Pro Open + API $1.00 $3.00 $0.20 MIT weights · ≤256k tier · overseas API
Step 3.7 Flash Open + API $0.20 $1.15 $0.04 Apache 2.0 · 256k ctx
Nemotron 3 Ultra Open weights - - - OpenMDW-1.1 · self-host or NVIDIA NIM · 550B/55B MoE

Pricing cross-checked 2026-06-01 · Qwen3.7-Plus added · Long-context tiers cost more · Plus not on OpenRouter yet

Independent composite scores

Artificial Analysis (independent) Vals.ai (independent)
Benchmark M3 DS V4-Pro K2.6 GLM-5.1 Q3.7-Max Q3.7-Plus MiMo-V2.5 Step-3.7 Nemotron
AA Intelligence Index v4.0 55 52 54 (K2.6) 51 57 - 54 - 47.7
Vals Index - 56.23% 55.55% (K2.6) 52.14% 57.29% - - - -

AA / Vals cross-checked 2026-06-14 · M3 now on AA (Intelligence Index 55) · GLM-5.2 not on AA/Vals yet · Q3.7-Plus pending · Kimi column = K2.6 (K2.7-Code only on MCP Atlas) · Nemotron 3 Ultra AA II 47.7 (US open-weight ref, below the frontier)

Full benchmark matrix (vendor-reported unless noted)

Official model card / blog Green highlight = best in row among models
Benchmark M3 DS V4-Pro Max K2.6 GLM-5.1 Q3.7-Max Q3.7-Plus MiMo-V2.5-Pro Step-3.7 Nemotron
Knowledge & Reasoning
MMLU-Pro - 87.5 - - 89.6 88.5 68.5 - 86.8
GPQA Diamond - 90.1 90.5 (K2.6) 86.2 92.4 90.3 66.7 78.41 87.0
HLE (no tools) - 37.7 34.7 (K2.6) 31.0 41.4 34.7 - 49.7 26.7
HLE w/ tools - 48.2 54.0 (K2.6) 52.3 - - - 48.1 37.4
AIME 2026 - - 96.4 (K2.6) 95.3 - - - - -
LiveCodeBench v6 - 93.5 89.6 (K2.6) - 91.6 89.6 39.6 - 89.0
IMOAnswerBench - 89.8 86.0 (K2.6) 83.8 - 86.0 - - 88.6
Apex Math Reasoning - 38.3 - - 44.5 22.7 - - -
Coding & Agentic
SWE-bench Verified - 80.6 80.2 (K2.6) - 80.4 77.7 78.9 76.5 70.7
SWE-bench Pro 59.0 55.4 58.6 (K2.6) 58.4 60.6 57.6 57.2 56.3 -
SWE-bench Multilingual - 76.2 76.7 (K2.6) - 78.3 75.8 - 72.4 67.7
Terminal-Bench 2.0 / 2.1 66.0 (2.1) 67.9 (2.0) 66.7 (2.0) (K2.6) 63.5 (2.0) 69.7 (2.0) 70.3 (2.0) 68.4 (2.0) 59.6 (2.1) 56.4 (2.1)
NL2Repo - - - 42.7 - 41.1 - - -
BrowseComp - 83.4 83.2 (K2.6) 68.0 - - - 75.82 44.4
Toolathlon - 51.8 50.0 (K2.6) 40.7 - - - 49.5 -
MCPAtlas Public 74.2 73.6 76.0 (K2.7) 71.8 76.4 73.2 - - -
ClawEval Pass³ - - 62.3 (K2.6) 62.7 65.2 62.7 64.0 67.1 -
GDPval-AA (Elo) - 1554 - - - - - 1415.8 -
τ²-Bench - - - - - - - - -
Skillsbench - - - - 59.2 54.9 - - -
SciCode - - 52.2 (K2.6) - 53.5 51.3 - - 44.6
CyberGym - - - 68.7 - - - - -
KernelBench Hard 28.8 - - - - - - - -
SWE-fficiency 34.8 - - - - - - - -
PostTrainBench 0.37 - - - - - - - -
SimpleVQA Search - - - - - 81.7 - 79.16 -
Long Context
MRCR 1M - 83.5 - - - - - - -
CorpusQA 1M - 62.0 - - - - - - -

Next.js agent evals (nextjs.org/evals · independent)

Vercel Next.js code gen & migration tasks (OpenCode / Claude Code / etc.) Green = best success % in row · Orange = fastest avg time

Official harness measuring pass rate on Next.js generation and migration workloads. Not comparable to SWE-bench-style vendor tables above. AGENTS.md column = extra passes when agents had bundled Next.js docs.

Metric M3 DS V4-Pro K2.6 GLM-5.1 Q3.7-Max Q3.7-Plus MiMo-V2.5 Step-3.7 Nemotron
Success rate 75% - - 75% - - - - -
Success w/ AGENTS.md 96% - - 100% - - - - -
Avg execution time 181.30s - - 254.36s - - - - -
Harness (on eval) OpenCode - - OpenCode - - - - -

Only MiniMax M3 and GLM 5.1 from this page match our matrix. Kimi K2.6, Qwen3.7, DeepSeek, MiMo, Step not listed yet. Eval still shows Kimi K2.5 (21% / 135s) and MiniMax M2.7 (50% / 294s) on the same leaderboard.

Independent evals (third-party leaderboards)

LMArena crowd Elo (public chat APIs) LiveBench (monthly refresh, contamination-resistant) Open LLM Leaderboard Average (open-weight HF submissions only) EQ-Bench Creative Writing v3 Elo Green = best in row

LMArena via arena-ai-leaderboards snapshot 2026-06-02 (top 20 per board). API-only columns (Q3.7-Max, Q3.7-Plus) get LMArena / LiveBench where listed; OLLB row is open-weight only.

Metric M3 DS V4-Pro K2.6 GLM-5.1 Q3.7-Max Q3.7-Plus MiMo-V2.5 Step-3.7 Nemotron
LMArena Text Elo - - - 1474 1475 - - - -
LMArena Code Elo - 1464 1518 (K2.6) 1533 1541 - 1471 - -
LiveBench global avg - 73.58 72.17 (K2.6) 70.18 74.29 - - - -
LiveBench agentic coding - 56.67 58.33 (K2.6) 55.00 51.67 - - - -
Open LLM LB Average TBD - - - - - - - -
EQ-Bench Creative Writing Elo - 1569.9 1781.6 (K2.6) 1644.8 - - - - -

LMArena arena IDs: Text qwen3.7-max-preview (1475), glm-5.1 (1474); Code qwen3.7-max-20260517 (1541), glm-5.1 (1533), kimi-k2.6 (1518), deepseek-v4-pro-thinking (1464), mimo-v2.5-pro (1471). Kimi / DeepSeek / MiMo not in Text top 20. LiveBench from livebench.ai (Jun 2026 crawl). OLLB: none of the five submitted HF weights appear in open-llm-leaderboard/contents yet (checked parquet, 4576 entries). EQ-Bench Elo from eqbench.com/creative_writing (Kimi-K2.6, DeepSeek-V4-Pro, GLM-5.1 HF entries only).

Sources