Chinese Frontier Model Benchmarks (Jun 2026)

Latest frontier flagship per lab · Excludes OpenAI, Anthropic, Google Gemini models. Numbers are vendor-reported from official READMEs/blogs unless marked as independent eval.

As of 1 Jun 2026 7 labs · 8 models Sources: vendor · AA · Vals · Next.js · LMArena · LiveBench · OLLB · EQ-Bench Pricing: official API · USD / M tokens

Latest model per lab

MiniMax
M3
Jun 1 · 1M ctx · MSA · MIT (weights ~10d)
$0.30 / $1.20 in · out
DeepSeek
V4-Pro Max
Apr 24 · 1.6T/49B act · 1M ctx · MIT
$0.435 / $0.87 in · out
Moonshot
Kimi K2.6
Apr 21 · 1T/32B act · 256K · Mod. MIT
$0.95 / $4.00 in · out
Z.ai
GLM-5.1
Apr 8 · 744B/40B act · 202K · MIT
$1.40 / $4.40 in · out
Alibaba
Qwen3.7-Max
May 19 · 1M ctx · text agent · API-only
$2.50 / $7.50 in · out
Alibaba
Qwen3.7-Plus
May 26 · 1M ctx · multimodal · API-only
$0.40 / $1.60 in · out
Xiaomi
MiMo-V2.5-Pro
Apr 22 · 1.02T/42B act · 1M · MIT
$1.00 / $3.00 in · out
StepFun
Step 3.7 Flash
May 29 · 198B/11B act · 256K · Apache 2.0
$0.20 / $1.15 in · out

API pricing (USD per 1M tokens · official vendor)

Purple highlight = lowest in row Open-weight models also support self-hosting (no per-token API fee)
Model Access Input (cache miss) Output Cached input Notes
MiniMax M3 API $0.30 $1.20 $0.06 ≤512k ctx · 50% launch promo (list $0.60 / $2.40)
DeepSeek V4-Pro Max Open + API $0.435 $0.87 $0.003625 MIT weights · 1M ctx
Kimi K2.6 Open + API $0.95 $4.00 $0.16 Mod. MIT · 256k ctx · thinking default
GLM-5.1 Thinking Open + API $1.40 $4.40 $0.26 MIT weights · 202k ctx
Qwen3.7-Max API only $2.50 $7.50 $0.25 Text agent · 1M ctx · Alibaba Model Studio
Qwen3.7-Plus API only $0.40 $1.60 - Multimodal · ≤256k tier · $1.20 / $4.80 above 256k
MiMo-V2.5-Pro Open + API $1.00 $3.00 $0.20 MIT weights · ≤256k tier · overseas API
Step 3.7 Flash Open + API $0.20 $1.15 $0.04 Apache 2.0 · 256k ctx

Pricing cross-checked 2026-06-01 · Qwen3.7-Plus added · Long-context tiers cost more · Plus not on OpenRouter yet

Independent composite scores

Artificial Analysis (independent) Vals.ai (independent)
Benchmark M3 DS V4-Pro K2.6 GLM-5.1 Q3.7-Max Q3.7-Plus MiMo-V2.5 Step-3.7
AA Intelligence Index v4.0 - 52 54 51 57 - 54 -
Vals Index - 56.23% 55.55% 52.14% 57.29% - - -

AA / Vals cross-checked 2026-06-01 · Q3.7-Plus not on AA or Vals yet · Plus numbers from qwen.ai/blog

Full benchmark matrix (vendor-reported unless noted)

Official model card / blog Green highlight = best in row among models
Benchmark M3 DS V4-Pro Max K2.6 GLM-5.1 Q3.7-Max Q3.7-Plus MiMo-V2.5-Pro Step-3.7
Knowledge & Reasoning
MMLU-Pro - 87.5 - - 89.6 88.5 68.5 -
GPQA Diamond - 90.1 90.5 86.2 92.4 90.3 66.7 78.41
HLE (no tools) - 37.7 34.7 31.0 41.4 34.7 - 49.7
HLE w/ tools - 48.2 54.0 52.3 - - - 48.1
AIME 2026 - - 96.4 95.3 - - - -
LiveCodeBench v6 - 93.5 89.6 - 91.6 89.6 39.6 -
IMOAnswerBench - 89.8 86.0 83.8 - 86.0 - -
Apex Math Reasoning - 38.3 - - 44.5 22.7 - -
Coding & Agentic
SWE-bench Verified - 80.6 80.2 - 80.4 77.7 78.9 76.5
SWE-bench Pro 59.0 55.4 58.6 58.4 60.6 57.6 57.2 56.3
SWE-bench Multilingual - 76.2 76.7 - 78.3 75.8 - 72.4
Terminal-Bench 2.0 / 2.1 66.0 (2.1) 67.9 (2.0) 66.7 (2.0) 63.5 (2.0) 69.7 (2.0) 70.3 (2.0) 68.4 (2.0) 59.6 (2.1)
NL2Repo - - - 42.7 - 41.1 - -
BrowseComp - 83.4 83.2 68.0 - - - 75.82
Toolathlon - 51.8 50.0 40.7 - - - 49.5
MCPAtlas Public 74.2 73.6 - 71.8 76.4 73.2 - -
ClawEval Pass³ - - 62.3 62.7 65.2 62.7 64.0 67.1
GDPval-AA (Elo) - 1554 - - - - - 1415.8
τ²-Bench - - - - - - - -
Skillsbench - - - - 59.2 54.9 - -
SciCode - - 52.2 - 53.5 51.3 - -
CyberGym - - - 68.7 - - - -
KernelBench Hard 28.8 - - - - - - -
SWE-fficiency 34.8 - - - - - - -
PostTrainBench 0.37 - - - - - - -
SimpleVQA Search - - - - - 81.7 - 79.16
Long Context
MRCR 1M - 83.5 - - - - - -
CorpusQA 1M - 62.0 - - - - - -

Next.js agent evals (nextjs.org/evals · independent)

Vercel Next.js code gen & migration tasks (OpenCode / Claude Code / etc.) Green = best success % in row · Orange = fastest avg time

Official harness measuring pass rate on Next.js generation and migration workloads. Not comparable to SWE-bench-style vendor tables above. AGENTS.md column = extra passes when agents had bundled Next.js docs.

Metric M3 DS V4-Pro K2.6 GLM-5.1 Q3.7-Max Q3.7-Plus MiMo-V2.5 Step-3.7
Success rate 75% - - 75% - - - -
Success w/ AGENTS.md 96% - - 100% - - - -
Avg execution time 181.30s - - 254.36s - - - -
Harness (on eval) OpenCode - - OpenCode - - - -

Only MiniMax M3 and GLM 5.1 from this page match our matrix. Kimi K2.6, Qwen3.7, DeepSeek, MiMo, Step not listed yet. Eval still shows Kimi K2.5 (21% / 135s) and MiniMax M2.7 (50% / 294s) on the same leaderboard.

Independent evals (third-party leaderboards)

LMArena crowd Elo (public chat APIs) LiveBench (monthly refresh, contamination-resistant) Open LLM Leaderboard Average (open-weight HF submissions only) EQ-Bench Creative Writing v3 Elo Green = best in row

LMArena via arena-ai-leaderboards snapshot 2026-06-02 (top 20 per board). API-only columns (Q3.7-Max, Q3.7-Plus) get LMArena / LiveBench where listed; OLLB row is open-weight only.

Metric M3 DS V4-Pro K2.6 GLM-5.1 Q3.7-Max Q3.7-Plus MiMo-V2.5 Step-3.7
LMArena Text Elo - - - 1474 1475 - - -
LMArena Code Elo - 1464 1518 1533 1541 - 1471 -
LiveBench global avg - 73.58 72.17 70.18 74.29 - - - -
LiveBench agentic coding - 56.67 58.33 55.00 51.67 - - - -
Open LLM LB Average TBD - - - - - - - -
EQ-Bench Creative Writing Elo - 1569.9 1781.6 1644.8 - - - - -

LMArena arena IDs: Text qwen3.7-max-preview (1475), glm-5.1 (1474); Code qwen3.7-max-20260517 (1541), glm-5.1 (1533), kimi-k2.6 (1518), deepseek-v4-pro-thinking (1464), mimo-v2.5-pro (1471). Kimi / DeepSeek / MiMo not in Text top 20. LiveBench from livebench.ai (Jun 2026 crawl). OLLB: none of the five submitted HF weights appear in open-llm-leaderboard/contents yet (checked parquet, 4576 entries). EQ-Bench Elo from eqbench.com/creative_writing (Kimi-K2.6, DeepSeek-V4-Pro, GLM-5.1 HF entries only).

Sources