Frontier Model Benchmarks · Non-US Labs (Jun 2026)

Latest frontier flagship per lab · Chinese labs + IplanRIO (Brazil) · Excludes OpenAI, Anthropic, Google Gemini. Numbers are vendor-reported from official READMEs/blogs unless marked as independent eval.

As of 1 Jun 2026 8 labs · 9 models Sources: vendor · AA · Vals · Next.js · LMArena · LiveBench · OLLB · EQ-Bench Pricing: official API · USD / M tokens

Latest model per lab

MiniMax
M3
Jun 1 · 428B/23B act · 1M ctx · MSA · MiniMax license
$0.30 / $1.20 in · out
AA Intelligence Index 55 · weights released
DeepSeek
V4-Pro Max
Apr 24 · 1.6T/49B act · 1M ctx · MIT
$0.435 / $0.87 in · out
Moonshot
Kimi K2.7-Code
Jun 12 · 1T/32B act · 256K · Mod. MIT · code-specialized
pricing TBD
Vendor benches: Kimi Code Bench v2 62.0 · Program Bench 53.6 · MCP Mark Verified 81.1 · MCP Atlas 76.0. Replaces K2.6 (general flagship).
Z.ai
GLM-5.2
Jun 13 · 1M ctx · MIT (weights ~1wk) · supersedes 5.1
Coding Plan ~$18/mo · API pending
Benchmarks pending — Zhipu published none at launch. Matrix below shows verified GLM-5.1.
Alibaba
Qwen3.7-Max
May 19 · 1M ctx · text agent · API-only
$2.50 / $7.50 in · out
Alibaba
Qwen3.7-Plus
May 26 · 1M ctx · multimodal · API-only
$0.40 / $1.60 in · out
Xiaomi
MiMo-V2.5-Pro
Apr 22 · 1.02T/42B act · 1M · MIT
$1.00 / $3.00 in · out
StepFun
Step 3.7 Flash
May 29 · 198B/11B act · 256K · Apache 2.0
$0.20 / $1.15 in · out
IplanRIO (Brazil)
Rio 3.5 Open 397B
Jun 2026 · 397B/17B act · 1M ctx · MIT · Qwen3.5-derived
open weights · self-host
Rio de Janeiro city-gov model post-trained from Qwen3.5-397B.

API pricing (USD per 1M tokens · official vendor)

Purple highlight = lowest in row Open-weight models also support self-hosting (no per-token API fee)
Model Access Input (cache miss) Output Cached input Notes
MiniMax M3 API $0.30 $1.20 $0.06 ≤512k ctx · 50% launch promo (list $0.60 / $2.40)
DeepSeek V4-Pro Max Open + API $0.435 $0.87 $0.003625 MIT weights · 1M ctx
Kimi K2.7-Code Open + API TBD TBD TBD Mod. MIT · 256k ctx · code-specialized · pricing not yet published
GLM-5.2 (5.1 pricing) Open + API $1.40 $4.40 $0.26 5.2 launched Jun 13 (1M ctx); standalone API pricing pending — shows GLM-5.1 rates
Qwen3.7-Max API only $2.50 $7.50 $0.25 Text agent · 1M ctx · Alibaba Model Studio
Qwen3.7-Plus API only $0.40 $1.60 - Multimodal · ≤256k tier · $1.20 / $4.80 above 256k
MiMo-V2.5-Pro Open + API $1.00 $3.00 $0.20 MIT weights · ≤256k tier · overseas API
Step 3.7 Flash Open + API $0.20 $1.15 $0.04 Apache 2.0 · 256k ctx
Rio 3.5 Open 397B Open weights - - - MIT · self-host · no official API endpoint

Pricing cross-checked 2026-06-01 · Qwen3.7-Plus added · Long-context tiers cost more · Plus not on OpenRouter yet

Independent composite scores

Artificial Analysis (independent) Vals.ai (independent)
Benchmark M3 DS V4-Pro K2.7-Code GLM-5.1 Q3.7-Max Q3.7-Plus MiMo-V2.5 Step-3.7 Rio 3.5
AA Intelligence Index v4.0 55 52 - 51 57 - 54 - -
Vals Index - 56.23% - 52.14% 57.29% - - - -

AA / Vals cross-checked 2026-06-14 · M3 now on AA (Intelligence Index 55) · Kimi K2.7-Code, GLM-5.2 & Rio 3.5 not on AA/Vals yet · Q3.7-Plus pending

Full benchmark matrix (vendor-reported unless noted)

Official model card / blog Green highlight = best in row among models
Benchmark M3 DS V4-Pro Max K2.7-Code GLM-5.1 Q3.7-Max Q3.7-Plus MiMo-V2.5-Pro Step-3.7 Rio 3.5
Knowledge & Reasoning
MMLU-Pro - 87.5 - - 89.6 88.5 68.5 - 88.0
GPQA Diamond - 90.1 - 86.2 92.4 90.3 66.7 78.41 90.9
HLE (no tools) - 37.7 - 31.0 41.4 34.7 - 49.7 36.5
HLE w/ tools - 48.2 - 52.3 - - - 48.1 -
AIME 2026 - - - 95.3 - - - - -
LiveCodeBench v6 - 93.5 - - 91.6 89.6 39.6 - -
IMOAnswerBench - 89.8 - 83.8 - 86.0 - - 89.5
Apex Math Reasoning - 38.3 - - 44.5 22.7 - - 29.2
Coding & Agentic
SWE-bench Verified - 80.6 - - 80.4 77.7 78.9 76.5 80.2
SWE-bench Pro 59.0 55.4 - 58.4 60.6 57.6 57.2 56.3 58.1
SWE-bench Multilingual - 76.2 - - 78.3 75.8 - 72.4 77.0
Terminal-Bench 2.0 / 2.1 66.0 (2.1) 67.9 (2.0) - 63.5 (2.0) 69.7 (2.0) 70.3 (2.0) 68.4 (2.0) 59.6 (2.1) 70.8 (2.1)
NL2Repo - - - 42.7 - 41.1 - - -
BrowseComp - 83.4 - 68.0 - - - 75.82 -
Toolathlon - 51.8 - 40.7 - - - 49.5 -
MCPAtlas Public 74.2 73.6 76.0 71.8 76.4 73.2 - - 74.2
ClawEval Pass³ - - - 62.7 65.2 62.7 64.0 67.1 -
GDPval-AA (Elo) - 1554 - - - - - 1415.8 1533 (est)
τ²-Bench - - - - - - - - -
Skillsbench - - - - 59.2 54.9 - - -
SciCode - - - - 53.5 51.3 - - -
CyberGym - - - 68.7 - - - - -
KernelBench Hard 28.8 - - - - - - - -
SWE-fficiency 34.8 - - - - - - - -
PostTrainBench 0.37 - - - - - - - -
SimpleVQA Search - - - - - 81.7 - 79.16 -
Long Context
MRCR 1M - 83.5 - - - - - - -
CorpusQA 1M - 62.0 - - - - - - -

Next.js agent evals (nextjs.org/evals · independent)

Vercel Next.js code gen & migration tasks (OpenCode / Claude Code / etc.) Green = best success % in row · Orange = fastest avg time

Official harness measuring pass rate on Next.js generation and migration workloads. Not comparable to SWE-bench-style vendor tables above. AGENTS.md column = extra passes when agents had bundled Next.js docs.

Metric M3 DS V4-Pro K2.7-Code GLM-5.1 Q3.7-Max Q3.7-Plus MiMo-V2.5 Step-3.7 Rio 3.5
Success rate 75% - - 75% - - - - -
Success w/ AGENTS.md 96% - - 100% - - - - -
Avg execution time 181.30s - - 254.36s - - - - -
Harness (on eval) OpenCode - - OpenCode - - - - -

Only MiniMax M3 and GLM 5.1 from this page match our matrix. Kimi K2.7-Code, Qwen3.7, DeepSeek, MiMo, Step, Rio not listed yet. Eval still shows Kimi K2.5 (21% / 135s) and MiniMax M2.7 (50% / 294s) on the same leaderboard.

Independent evals (third-party leaderboards)

LMArena crowd Elo (public chat APIs) LiveBench (monthly refresh, contamination-resistant) Open LLM Leaderboard Average (open-weight HF submissions only) EQ-Bench Creative Writing v3 Elo Green = best in row

LMArena via arena-ai-leaderboards snapshot 2026-06-02 (top 20 per board). API-only columns (Q3.7-Max, Q3.7-Plus) get LMArena / LiveBench where listed; OLLB row is open-weight only.

Metric M3 DS V4-Pro K2.7-Code GLM-5.1 Q3.7-Max Q3.7-Plus MiMo-V2.5 Step-3.7 Rio 3.5
LMArena Text Elo - - - 1474 1475 - - - -
LMArena Code Elo - 1464 - 1533 1541 - 1471 - -
LiveBench global avg - 73.58 - 70.18 74.29 - - - -
LiveBench agentic coding - 56.67 - 55.00 51.67 - - - -
Open LLM LB Average TBD - - - - - - - -
EQ-Bench Creative Writing Elo - 1569.9 - 1644.8 - - - - -

LMArena arena IDs: Text qwen3.7-max-preview (1475), glm-5.1 (1474); Code qwen3.7-max-20260517 (1541), glm-5.1 (1533), kimi-k2.6 (1518), deepseek-v4-pro-thinking (1464), mimo-v2.5-pro (1471). Kimi / DeepSeek / MiMo not in Text top 20. Kimi column now = K2.7-Code (code-specialized, Jun 2026), not yet on LMArena/LiveBench/EQ-Bench — prior K2.6 Elos removed. LiveBench from livebench.ai (Jun 2026 crawl). OLLB: none of the five submitted HF weights appear in open-llm-leaderboard/contents yet (checked parquet, 4576 entries). EQ-Bench Elo from eqbench.com/creative_writing (Kimi-K2.6, DeepSeek-V4-Pro, GLM-5.1 HF entries only).

Sources