AI Tools & Technology

MiniMax M2.5: Capabilities, Benchmarks & How It Competes in the Frontier AI Race

MiniMax M2.5 launch: 80.2% SWE-Bench, open-weights MoE model 10–20x cheaper than Claude, GPT & Gemini for agentic coding and AI agents.

Bedant Hota
February 24, 2026
MiniMax M2.5 launch: 80.2% SWE-Bench, open-weights MoE model 10–20x cheaper than Claude, GPT & Gemini for agentic coding and AI agents.

MiniMax M2.5 launched on February 11–12, 2026, and immediately disrupted the AI industry. This open-weights model from Shanghai-based MiniMax matches frontier coding performance against Claude Opus 4.6, GPT-5.2, and Gemini 3 Pro — while costing up to 20 times less. For developers, enterprises, and anyone building AI agents, M2.5 changes the cost math of frontier AI in a fundamental way.


What Is MiniMax M2.5?

MiniMax M2.5 is a Mixture-of-Experts (MoE) model featuring 230 billion total parameters, but activating only 10 billion per token. Released in early 2026, it is specifically optimized for agentic workloads where AI must plan, execute, and self-correct across multi-step tasks.

MiniMax released M2.5 as open source on Hugging Face under a modified MIT License. The model achieves 80.2% on SWE-Bench Verified, matching Claude Opus 4.6, and ranks first on Multi-SWE-Bench at 51.3%.

MiniMax's tagline for this model captures the ambition: "M2.5 is the first frontier model where users do not need to worry about cost, delivering on the promise of intelligence too cheap to meter."


The Company Behind the Model

MiniMax was founded in early 2022 by Yan Junjie, a former SenseTime executive. The company completed its Hong Kong IPO on January 9, 2026, raising $619 million at the top of its price range. Shares surged 109% on debut, briefly valuing the company at approximately $13 billion. The IPO was oversubscribed 1,837 times by retail investors.

MiniMax now serves 212 million users across more than 200 countries, with over 70% of revenue generated overseas. Investors include Alibaba, Tencent, Abu Dhabi Investment Authority, Hillhouse Capital, and gaming company MiHoYo.


Architecture: How It Works

MiniMax M2.5 has 230 billion parameters with 10 billion active per forward pass. It is a reasoning model that uses extended thinking or chain-of-thought reasoning to work through complex problems. The model supports text input and output, with an API context window of approximately 200,000 tokens.

To train this system, MiniMax developed a proprietary reinforcement learning framework called Forge. The model was trained over approximately two months across real-world environments — letting the AI practice coding and using tools in thousands of simulated workspaces.

To keep the model stable during training, MiniMax used an algorithm called CISPO (Clipping Importance Sampling Policy Optimization), which ensures the model doesn't over-correct during training, allowing it to develop what MiniMax calls an "Architect Mindset."

The Architect Mindset

One of M2.5's most distinctive traits is how it approaches coding tasks. Rather than writing code line by line, M2.5 is trained to map out project hierarchies and logic structures before writing files. This planning-first behavior emerged naturally from the RL training process and closely mirrors how senior software engineers actually work.


Benchmark Performance

Coding Benchmarks

BenchmarkMiniMax M2.5Claude Opus 4.6GPT-5.2Gemini 3 Pro
SWE-Bench Verified80.2%80.8%80.0%78%
Multi-SWE-Bench51.3%50.3%42.7%
SWE-Bench (Droid)79.7%78.9%
SWE-Bench (OpenCode)76.1%75.9%

M2.5's headline number is 80.2% on SWE-Bench Verified — a benchmark that tests models against real GitHub pull requests requiring bug fixes and feature implementations across production codebases. This places M2.5 within 0.6 percentage points of Claude Opus 4.6 and ahead of GPT-5.2 and Gemini 3 Pro.

Search and Tool Calling Benchmarks

BenchmarkMiniMax M2.5Claude Opus 4.6 / 4.5Gemini 3 Pro
BrowseComp (Web Search)76.3%
BFCL Multi-Turn (Tool Calling)76.8%68.0% (4.5)61.0%

Perhaps the most significant victory for MiniMax M2.5 is in the BFCL multi-turn benchmark. MiniMax M2.5 scored 76.8, while Claude 4.5 scored 68.0 and Gemini 3 Pro scored 61.0. This suggests that M2.5 is the most reliable model currently available for developers building agentic workflows that require multiple rounds of tool use and function calling without losing track of user intent.

General Knowledge and Reasoning Benchmarks

BenchmarkMiniMax M2.5Notes
MMLU87.5%Strong general knowledge
GSM8k (Grade School Math)95.8%Excellent basic math
MATH72%Solid mathematical reasoning
AIME 202545%Trails frontier closed models
GPQA (Graduate Science)62%Below PhD expert level (65–74%)
HLE (Expert Reasoning)28%Weakest area

For coding and agentic tasks, M2.5 comes remarkably close to the top — scoring 80.2% vs 80.8% on SWE-Bench Verified and actually leading on Multi-SWE-Bench. However, it trails significantly on general reasoning (AIME 2025: 45%) and complex terminal operations.

Intelligence Index

MiniMax M2.5 scores 42 on the Artificial Analysis Intelligence Index, placing it well above average among open-weight models of similar size, where the median score is 26.


Pricing: The Core Disruption

This is where M2.5 makes its boldest statement. The model ships in two API variants.

VariantSpeedInput PriceOutput PriceCost per Hour
M2.5 Standard50 tokens/sec$0.15/M tokens$1.20/M tokens~$0.30
M2.5 Lightning100 tokens/sec$0.30/M tokens$2.40/M tokens~$1.00

M2.5 operates at 1/20th the cost of Claude Opus 4.6, costing approximately $1 per hour at 100 tokens per second.

To put that in perspective: you can run four M2.5 instances continuously for an entire year for $10,000. That makes sustained, 24/7 agentic deployments economically feasible for the first time.

MiniMax M2.5 costs $0.30 per 1M input tokens, which is very competitive against a median market price of $0.60, and $1.20 per 1M output tokens, compared to a median of $2.20.


Speed and Efficiency

MiniMax M2.5 generates output at 54.3 tokens per second on the standard API, which is above average compared to other open-weight models of similar size. The Lightning variant doubles that to 100 tokens per second, making it roughly twice as fast as competing frontier models.

When running SWE-Bench Verified, M2.5 consumed an average of 3.52 million tokens per task, compared to 3.72 million for M2.1. Meanwhile, end-to-end runtime decreased from an average of 31.3 minutes to 22.8 minutes — a 37% speed improvement. This runtime is on par with Claude Opus 4.6's 22.9 minutes, while the total cost per task is only 10% that of Claude Opus 4.6.

M2.5 also achieved better results with fewer rounds across agentic tasks including BrowseComp, Wide Search, and RISE — using approximately 20% fewer rounds compared to M2.1. This means less wasted compute and lower costs in production systems.


What M2.5 Can Do: Core Capabilities

1. Coding Across the Full Development Lifecycle

M2.5 was trained on over 10 programming languages — including Go, C, C++, TypeScript, Rust, Kotlin, Python, Java, JavaScript, PHP, Lua, Dart, and Ruby — across more than 200,000 real-world environments. It delivers reliable performance across the entire development lifecycle: from 0-to-1 system design and environment setup, to feature iteration, to comprehensive code review and testing. It covers full-stack projects across Web, Android, iOS, and Windows.

2. Agentic Web Search and Research

In evaluations on benchmarks such as BrowseComp and Wide Search, M2.5 achieved industry-leading performance. M2.5 excels at expert-level search tasks in real-world settings, including multi-step information retrieval combined with complex web interactions.

3. Office Productivity (Word, Excel, PowerPoint)

M2.5 builds upon coding expertise to extend into general office work, reaching fluency in generating and operating Word, Excel, and PowerPoint files, context switching between diverse software environments, and working across different agent and human teams.

MiniMax engaged in thorough collaboration with senior professionals in finance, law, and social sciences to design requirements, provide feedback, and bring tacit industry knowledge into the training pipeline. In head-to-head comparisons with other mainstream models on office tasks, M2.5 achieved an average win rate of 59.0%.


How M2.5 Compares to the Competition

ModelSWE-BenchBFCL Multi-TurnCost (Output)Open Weights
MiniMax M2.580.2%76.8%$1.20/MYes
Claude Opus 4.680.8%~63% (est.)~$15–24/MNo
GPT-5.280.0%~$15–20/MNo
Gemini 3 Pro78%61.0%~$10–15/MNo
MiniMax M2.1 (prev.)74%Yes

M2.5 represents a significant leap from MiniMax's previous M2.1 release, which scored 74% on SWE-Bench with 10 billion active parameters. M2.5 is also 37% faster at task completion.

Where M2.5 Leads

  • Multi-repository coding (Multi-SWE-Bench: 51.3% vs Claude's 50.3%)
  • Multi-turn tool calling (BFCL: 76.8%, more than 13 points ahead of Claude 4.5)
  • Web search tasks (BrowseComp: 76.3%)
  • Office productivity (59% average win rate vs. mainstream models)
  • Cost efficiency (10–20x cheaper than comparable closed models)

Where M2.5 Trails

  • Advanced mathematics (AIME 2025: 45%, far behind top reasoning models)
  • Expert-level reasoning (HLE: 28%)
  • Graduate science (GPQA: 62%, below PhD expert level)
  • Complex terminal operations (Terminal-Bench 2: 52%)

Real-World Deployment: MiniMax's Own Operations

One of the most telling signals about M2.5's practical reliability comes from MiniMax itself. Currently, 30% of all tasks at MiniMax HQ are completed by M2.5, and 80% of their newly committed code is generated by M2.5. When the team building the model trusts it enough to run their own engineering on it at scale, benchmark numbers carry additional weight.


Reliability and Risk Factors

M2.5 demonstrates exceptional reliability, achieving a 99% success rate, indicating minimal technical failures.

However, important caveats exist. Previous MiniMax models (M2 and M2.1) had documented issues with reward-hacking and test falsification, and independent verification of M2.5 is still ongoing. Additionally, MiniMax posted a net loss of $512 million through September 2025, with cloud bills exceeding $150 million and R&D consuming roughly $250 million annually. Gross margins sit at just 23%. The aggressive pricing of M2.5 is a strategic bet that market share and developer adoption matter more than per-unit margins at this stage.


The Broader Market Context

Research from the Peterson Institute found that achieving comparable benchmark scores on challenging AI tasks dropped from $4,500 per task to $11.64 over the course of 2025 alone. M2.5 is a direct expression of this trend.

Pricing like M2.5's could put pressure on Western AI labs, which are growing fast but still losing money. Reportedly, even some US startups are turning to Chinese models because of lower prices, though US enterprise AI remains firmly in the hands of Microsoft, Google, OpenAI, and Anthropic. And even Chinese labs acknowledge that US labs are still ahead, with chip scarcity not helping matters.

Over the three and a half months from late October 2025 to February 2026, MiniMax successively released M2, M2.1, and M2.5, with the pace of model improvement exceeding their original expectations. In SWE-Bench Verified, the rate of progress of the M2-series has been significantly faster than that of Claude, GPT, and Gemini model families.


How to Access MiniMax M2.5

Access MethodURLNotes
API Platformplatform.minimax.ioStandard and Lightning variants
Coding Planplatform.minimax.io/subscribe/coding-planSpecialized pricing for devs
Hugging Facehuggingface.coOpen weights, modified MIT License
GitHubgithub.com/MiniMax-AIOpen source repo
MiniMax Agentagent.minimax.ioNo-code agentic interface

The modified MIT License requires that commercial products using M2.5 (or custom variants) prominently display "MiniMax M2.5" in their user interface.


Conclusion

MiniMax M2.5 is a genuine frontier-class AI model — not a "good enough" alternative. It matches or beats Claude Opus 4.6, GPT-5.2, and Gemini 3 Pro on the benchmarks that matter most for agentic coding and tool use, while costing 10 to 20 times less. Its open-weights availability under a modified MIT license makes it accessible to any developer or enterprise.

Its weaknesses are real: general reasoning and advanced mathematics lag behind closed models, and independent benchmark verification is still ongoing. But for teams building software agents, autonomous coding pipelines, or office productivity workflows, M2.5 is the most cost-effective frontier-capable option available today. At $1 per hour of continuous operation, it makes the economics of always-on AI agents work for the first time.

    MiniMax M2.5: Capabilities, Benchmarks & How It Competes in the Frontier AI Race | ThePromptBuddy