GLM-5 vs Gemini 3 Pro vs MiniMax M2.5: The Best Open-Source AI Models Compared

Overview

Three powerful AI models dropped in quick succession in late 2025 and early 2026. GLM-5 was officially released on February 11, 2026, by Z.ai (Zhipu AI). Gemini 3 Pro arrived earlier, on November 18, 2025, from Google. MiniMax released M2.5 as open source on February 11, 2026.

All three are built for serious, real-world AI work — coding, reasoning, long-horizon agent tasks, and document generation. But they differ sharply in size, price, strengths, and who should use them.

This comparison gives you the full picture across benchmarks, pricing, architecture, and use cases so you can pick the right model for your needs.

At a Glance: Key Specs Compared

Feature	GLM-5	Gemini 3 Pro	MiniMax M2.5
Developer	Zhipu AI / Z.ai	Google DeepMind	MiniMax
Release Date	Feb 11, 2026	Nov 18, 2025	Feb 11–12, 2026
Total Parameters	744B	Unknown (proprietary)	230B
Active Parameters	40B	Unknown	10B
Architecture	MoE	Unknown (proprietary)	MoE
Context Window	200K tokens	1M tokens	200K tokens
License	MIT (open-source)	Proprietary	Modified MIT
Multimodal	No (text + code)	Yes (text, image, video, audio)	No (text only)
API Input Price	~$1.00/M tokens	$2.00/M tokens	$0.15–$0.30/M tokens
API Output Price	~$3.20/M tokens	$12.00/M tokens	$1.20–$2.40/M tokens

Benchmark Comparison

This is where the models differ the most. Here is how they score across the benchmarks that matter most.

Reasoning & Knowledge

Benchmark	GLM-5	Gemini 3 Pro	MiniMax M2.5
Humanity's Last Exam (no tools)	30.5%	37.5%	28%
Humanity's Last Exam (with tools)	50.4%	—	—
GPQA Diamond	86.0%	91.9%	62%
AIME 2026	92.7%	90.6%	—
MMLU	—	—	87.5%

Gemini 3 Pro leads reasoning benchmarks, achieving PhD-level performance with a top score of 91.9% on GPQA Diamond. GLM-5 scores 86.0% on GPQA Diamond and 92.7% on AIME 2026, essentially matching Gemini 3 Pro on math.

Coding & Software Engineering

Benchmark	GLM-5	Gemini 3 Pro	MiniMax M2.5
SWE-Bench Verified	77.8%	78%	80.2%
Multi-SWE-Bench	73.3%	42.7%	51.3%
Terminal-Bench 2.0	56.2%	—	—
BrowseComp	75.9%	—	76.3%
BFCL (Tool Calling)	—	—	76.8%

MiniMax M2.5 scores 80.2% on SWE-Bench Verified, placing it within 0.6 percentage points of Claude Opus 4.6, and leads on Multi-SWE-Bench at 51.3%. GLM-5 hits 77.8% on SWE-Bench Verified, making it the top open-source model on that benchmark at the time of its release.

Multimodal & Special Tasks

Benchmark	GLM-5	Gemini 3 Pro	MiniMax M2.5
MMMU-Pro (multimodal)	—	81%	—
Video-MMMU	—	87.6%	—
ARC-AGI-2	—	45.1% (Deep Think)	—
LMArena Elo	—	1501	—
Vending Bench 2	$4,432 final balance	$5,478 final balance	—

Gemini 3 Pro tops the LMArena Leaderboard with a score of 1501 Elo and leads on multimodal benchmarks with 81% on MMMU-Pro and 87.6% on Video-MMMU. This is a category where GLM-5 and MiniMax M2.5 simply do not compete — both are text and code only.

Architecture Deep Dive

GLM-5: Scale and Efficiency

At the heart of GLM-5 is a massive leap in raw parameters. The model scales from the 355B parameters of GLM-4.5 to 744B parameters, with 40B active per token in its Mixture-of-Experts architecture.

GLM-5 integrates DeepSeek Sparse Attention (DSA), supporting a 200K-token context window, and was trained entirely on Huawei Ascend chips using the MindSpore framework, with zero dependency on NVIDIA hardware. This makes it geopolitically significant — a frontier model built entirely outside the US chip ecosystem.

One major technical innovation is "Slime," Z.ai's novel asynchronous reinforcement learning system that made training a 744B model at scale actually feasible.

Gemini 3 Pro: The Multimodal Flagship

Gemini 3 Pro is Google's answer to the question: what happens when you combine reasoning, multimodality, and a massive context window? It features a 1 million-token input context window with 64K output tokens and uses dynamic thinking by default to reason through prompts.

It was built to seamlessly synthesize information across text, images, video, audio, and code. No other model in this comparison comes close to that breadth. The tradeoff is that it is proprietary — you cannot download or self-host it.

MiniMax M2.5: Efficiency as a Strategy

M2.5 is a 230B MoE model with only 10B active parameters per forward pass, trained using the Forge RL framework across 200,000+ real-world environments.

MiniMax developed a proprietary Reinforcement Learning framework called Forge, designed to help the model learn from real-world environments — essentially letting the AI practice coding and using tools in thousands of simulated workspaces. This is why M2.5 punches so far above its parameter count on coding tasks.

Pricing Comparison

Cost is where these three models diverge the most dramatically.

Pricing	GLM-5	Gemini 3 Pro	MiniMax M2.5 Standard	MiniMax M2.5 Lightning
Input (per 1M tokens)	$1.00	$2.00	$0.15	$0.30
Output (per 1M tokens)	$3.20	$12.00	$1.20	$2.40
Speed	~17–19 tok/s	—	50 tok/s	100 tok/s
Approx. cost vs Claude Opus 4.6	~6x cheaper	—	~20x cheaper	~10x cheaper

M2.5-Lightning generates 100 tokens per second, making it twice as fast as other top models, and MiniMax claims one hour of continuous operation costs just one dollar.

GLM-5 is approximately 6x cheaper on input and nearly 10x cheaper on output than Claude Opus 4.6.

Gemini 3 Pro is the most expensive of the three, but it is also the only one offering true multimodal capabilities, which justifies the premium for certain use cases.

What Each Model Does Best

GLM-5 Excels At:

Long-horizon agentic tasks and complex systems engineering
Record-low hallucination rate — best in the industry for "knowing what it doesn't know"
Web research and information retrieval (leads on BrowseComp among open models)
Document generation (native .docx, .pdf, .xlsx output in agent mode)
Self-hosted deployment on non-NVIDIA hardware

GLM-5 achieved a score of -1 on the AA-Omniscience Index — a 35-point improvement over its predecessor — meaning it now leads the entire AI industry in knowledge reliability by knowing when to abstain rather than fabricate information.

Gemini 3 Pro Excels At:

Multimodal tasks involving images, video, and audio
Very long documents requiring a 1M-token context window
Abstract reasoning (ARC-AGI-2) and scientific problem-solving
Tasks where you need a hosted, managed API with no self-hosting

Gemini 3 with Deep Think mode achieves 45.1% on ARC-AGI-2, demonstrating its ability to solve novel challenges. That is a benchmark score that measures genuine reasoning on problems the model has never seen — not just pattern matching.

MiniMax M2.5 Excels At:

Cost-sensitive agentic coding at scale
Multi-turn tool calling (leads all three models on BFCL)
Office productivity tasks: Word, Excel, PowerPoint automation
High-volume, 24/7 autonomous agent deployment

In benchmarks, M2.5 outperforms Claude Opus 4.6, GPT-5.2, and Gemini 3 Pro on web search and office tasks, at ten to twenty times lower cost.

Who Should Use Which Model?

Use Case	Best Model
Multimodal tasks (images, video, audio)	Gemini 3 Pro
Very long documents (500K–1M tokens)	Gemini 3 Pro
Abstract reasoning and science	Gemini 3 Pro
Open-source, self-hosted coding agent	GLM-5 or MiniMax M2.5
Record-low hallucination / factual reliability	GLM-5
Non-NVIDIA hardware deployment	GLM-5
High-volume agentic coding on a budget	MiniMax M2.5
24/7 autonomous agent operation	MiniMax M2.5 Lightning
Office automation (Word, Excel, PPT)	MiniMax M2.5
Multi-turn tool calling	MiniMax M2.5

Open-Source Status

This is a critical difference if your organization needs model transparency, customization, or data sovereignty.

Model	License	Download Weights	Commercial Use
GLM-5	MIT	Yes (Hugging Face)	Yes, unrestricted
Gemini 3 Pro	Proprietary	No	API only
MiniMax M2.5	Modified MIT	Yes (Hugging Face)	Yes, with branding requirement

MiniMax made the model available on Hugging Face under a modified MIT License requiring that those using the model for commercial purposes prominently display "MiniMax M2.5" on the user interface of such product or service.

GLM-5's standard MIT license is the most permissive — no branding requirements, no restrictions.

Limitations to Know

GLM-5: Inference speed of 17–19 tok/s is noticeably slower than NVIDIA-backed competitors, and there is a 9-point deficit on Terminal-Bench 2.0 versus leading proprietary models. Some early users have also noted it is "less situationally aware" despite high benchmark scores.

Gemini 3 Pro: Proprietary — no self-hosting, no weight access. Most expensive of the three. Knowledge cutoff is January 2025.

MiniMax M2.5: The model does not support image input and is not multimodal — it only processes text. It also has higher verbosity than average, generating significantly more tokens per response.

The Bigger Picture

All three models arrived as part of a broader wave of AI releases in early 2026. Chinese AI labs — Zhipu AI and MiniMax — are demonstrating that frontier-class performance no longer requires US-manufactured silicon or closed-source infrastructure.

GLM-5 is proof that frontier AI performance no longer requires American silicon or closed-source moats — every parameter was trained on 100,000 Huawei Ascend 910B chips using the MindSpore framework.

MiniMax says that 30% of all tasks at MiniMax HQ are completed by M2.5, and 80% of their newly committed code is generated by M2.5.

Gemini 3 Pro remains the leader for multimodal reasoning and pure benchmark dominance, but at a price and accessibility tradeoff that matters for many teams.

Conclusion

There is no single "best" model here — it depends entirely on your needs.

Choose Gemini 3 Pro if you need multimodal capabilities, a 1M-token context window, or top scores on abstract reasoning. Choose GLM-5 if you need an open-weight model with record-low hallucination rates, strong agentic performance, and hardware flexibility. Choose MiniMax M2.5 if cost efficiency is your top priority and you are running high-volume coding agents or office automation workflows.

The most exciting takeaway: as of February 2026, you can access near-frontier AI coding performance for as little as $1 per hour of continuous operation. That changes what is economically possible to build.