GitHub

MiniMax-M2 reboots the trillion-parameter Mixture-of-Experts playbook with a very specific ambition: a “deep thinking” model that is actually deployable. At minimax-m2.com we operate the M2 model behind a managed SaaS stack so teams can ship agentic workflows, <think> reasoning traces, and OpenAI-compatible end-points without wrangling GPU orchestration. This article distils what we have learned from benchmarking, reading the MiniMax releases, and running M2 in production-like settings.

Origins and Design Philosophy

MiniMax M2 is the latest flagship from the MiniMax research group. The team optimised for two opposing forces: frontier-level reasoning quality and practical inference latency. The answer is a 230B-parameter MoE Transformer where ~10B active parameters fire per token. You get the reasoning depth of a 200B+ dense model, but with serving costs closer to 30B–70B-class systems.

Key design traits:

Sparse routing – 128 experts, 2 selected per token, with load-balanced gating to avoid the “hot expert” problem.
Reasoning-aware pretraining – more than 20T tokens with staged curricula emphasising math, code, and tool-use late in training.
Trajectory supervision – streamed <think> traces paired with the final answer so downstream products can display or hide “thought” segments independently.
Extended context – 32K by default with YaRN rotary scaling up to 128K tokens for retrieval-heavy workflows.

Training Stack & Optimisation Breakthroughs

MiniMax’s public research drops highlight several ingredients that stood out in our evaluations:

Hybrid FP8/BF16 schedule keeps memory usage low while preserving stability at trillion-scale.
Curriculum pivot halfway through training to reasoning-focused corpora, mirroring what successful RLHF pipelines later reward.
Evo-CoT + policy optimisation to align the reasoning style; these steps make M2’s <think> output far less repetitive than earlier sparse models.
Sparse-aware inference kernels tuned for H100/H200 GPU clusters with TensorRT-LLM contributions already upstreamed.

For teams fine-tuning or adapter-tuning on top of M2, respecting its multi-stage prompt style is crucial. We have seen LoRA adapters converge quickly when system prompts explicitly request structured thinking (e.g., “First list constraints, then outline a plan, then produce the answer.”).

Benchmark Highlights

MiniMax-M2 posts competitive numbers against both proprietary and open models:

AIME 2025 – 70%+ accuracy with short reasoning chains.
MMLU Redux / STEM – >90 and >88 respectively, beating recent DeepSeek and Kimi releases.
OlympiadBench – 90+, signalling contest-level math competence.
LiveCodeBench & SWE-Bench Verified – leading scores on code generation, patching, and artifact completeness.
BrowseBench & BFCL Tool Use – strong out-of-the-box tool orchestration with minimal hand-crafted traces.

On minimax-m2.com customers care about the practical translation of these numbers: fewer hallucinated steps in financial modelling, better traceability in regulated workflows (thanks to <think>), and real-time dashboards to audit token spend per workspace.

Deployment & Inference at minimax-m2.com

Running M2 yourself requires 8× H200-class GPUs or larger pools for sustained throughput. We abstract that complexity. Our stack pairs:

Next.js 15 + OpenNext for the dashboard and developer experience.
Cloudflare Workers for edge authentication, rate limiting, and streaming.
Autoscaled GPU pools (TensorRT-LLM and vLLM) exposing an OpenAI-compatible /v1/chat/completions and an Anthropic-compatible /v1/messages API.
Reasoning split toggles that automatically separate <think> traces from the final answer so you can render them however you want.

Customers can start with the hosted chat UI, save conversations (stored only for front-end display), and graduate to API usage when privacy policies require transient processing—API traffic is never stored after a response is delivered.

Economics & Pricing Signals

MiniMax quotes reference rates of roughly $0.50 per million input tokens and $1.50 per million output tokens for self-hosted deployments. Our metered billing mirrors that structure, with a generous launch-period free tier so teams can pilot the chat UI and API without incurring costs. Because API payloads are not retained, you can meet stricter compliance postures without bolting on extra data deletion flows.

Need predictable capacity? We offer reserved throughput tiers with dedicated clusters, the same reasoning split controls, and custom SLAs for production workloads.

Best Practices for Builders

Prompt for structure – Ask MiniMax-M2 to reason explicitly (“Plan → Decide → Answer”). The <think> output will reflect those steps and is easy to display in the UI we ship by default.
Use the dashboard for chat history only – Front-end conversations are stored so you can pick up where you left off. If your policy forbids retention, turn it off in workspace settings or use the API where we don’t persist payloads.
Monitor budget – Token-based billing is transparent; our dashboards surface spend per workspace, persona, or API key.
Safety overlays – Pair the built-in moderation with any application-specific guardrails (PII scrubbing, compliance filters).

Roadmap

We are actively working on:

Deep reasoning UX – richer <think> viewers, collapsible traces, and diff comparisons between iterations.
Team controls – RBAC, audit logs, and SOC 2 documentation for enterprises adopting MiniMax-M2.
Adapter hosting – managed LoRA/QLoRA adapters so you can bring narrow domain knowledge without retraining the base model.
Observation APIs – streaming hooks that expose expert routing stats, latency, and confidence measures for observability platforms.

Final Thoughts

With MiniMax-M2 we finally have a reasoning-first MoE that is both transparent (thanks to thought-stream support) and deployable. Pair it with the managed minimax-m2.com platform and your teams can focus on building differentiated agents, copilots, or research tools instead of managing GPUs or compliance workflows.

Ready to start? Create an account, verify your email, and run your first chat in minutes—or grab an API key and integrate MiniMax-M2 into your stack today.

Origins and Design Philosophy

Key design traits:

Sparse routing – 128 experts, 2 selected per token, with load-balanced gating to avoid the “hot expert” problem.
Reasoning-aware pretraining – more than 20T tokens with staged curricula emphasising math, code, and tool-use late in training.
Trajectory supervision – streamed <think> traces paired with the final answer so downstream products can display or hide “thought” segments independently.
Extended context – 32K by default with YaRN rotary scaling up to 128K tokens for retrieval-heavy workflows.

Training Stack & Optimisation Breakthroughs

MiniMax’s public research drops highlight several ingredients that stood out in our evaluations:

Hybrid FP8/BF16 schedule keeps memory usage low while preserving stability at trillion-scale.
Curriculum pivot halfway through training to reasoning-focused corpora, mirroring what successful RLHF pipelines later reward.
Evo-CoT + policy optimisation to align the reasoning style; these steps make M2’s <think> output far less repetitive than earlier sparse models.
Sparse-aware inference kernels tuned for H100/H200 GPU clusters with TensorRT-LLM contributions already upstreamed.

Benchmark Highlights

MiniMax-M2 posts competitive numbers against both proprietary and open models:

AIME 2025 – 70%+ accuracy with short reasoning chains.
MMLU Redux / STEM – >90 and >88 respectively, beating recent DeepSeek and Kimi releases.
OlympiadBench – 90+, signalling contest-level math competence.
LiveCodeBench & SWE-Bench Verified – leading scores on code generation, patching, and artifact completeness.
BrowseBench & BFCL Tool Use – strong out-of-the-box tool orchestration with minimal hand-crafted traces.

Deployment & Inference at minimax-m2.com

Running M2 yourself requires 8× H200-class GPUs or larger pools for sustained throughput. We abstract that complexity. Our stack pairs:

Next.js 15 + OpenNext for the dashboard and developer experience.
Cloudflare Workers for edge authentication, rate limiting, and streaming.
Autoscaled GPU pools (TensorRT-LLM and vLLM) exposing an OpenAI-compatible /v1/chat/completions and an Anthropic-compatible /v1/messages API.
Reasoning split toggles that automatically separate <think> traces from the final answer so you can render them however you want.

Economics & Pricing Signals

Need predictable capacity? We offer reserved throughput tiers with dedicated clusters, the same reasoning split controls, and custom SLAs for production workloads.

Best Practices for Builders

Prompt for structure – Ask MiniMax-M2 to reason explicitly (“Plan → Decide → Answer”). The <think> output will reflect those steps and is easy to display in the UI we ship by default.
Use the dashboard for chat history only – Front-end conversations are stored so you can pick up where you left off. If your policy forbids retention, turn it off in workspace settings or use the API where we don’t persist payloads.
Monitor budget – Token-based billing is transparent; our dashboards surface spend per workspace, persona, or API key.
Safety overlays – Pair the built-in moderation with any application-specific guardrails (PII scrubbing, compliance filters).

Roadmap

We are actively working on:

Deep reasoning UX – richer <think> viewers, collapsible traces, and diff comparisons between iterations.
Team controls – RBAC, audit logs, and SOC 2 documentation for enterprises adopting MiniMax-M2.
Adapter hosting – managed LoRA/QLoRA adapters so you can bring narrow domain knowledge without retraining the base model.
Observation APIs – streaming hooks that expose expert routing stats, latency, and confidence measures for observability platforms.

Final Thoughts

Ready to start? Create an account, verify your email, and run your first chat in minutes—or grab an API key and integrate MiniMax-M2 into your stack today.

Origins and Design Philosophy

Training Stack & Optimisation Breakthroughs

Benchmark Highlights

Deployment & Inference at minimax-m2.com

Economics & Pricing Signals

Best Practices for Builders

Roadmap

Final Thoughts

Author

Categories

Table of Contents

Newsletter

Origins and Design Philosophy

Training Stack & Optimisation Breakthroughs

Benchmark Highlights

Deployment & Inference at minimax-m2.com

Economics & Pricing Signals

Best Practices for Builders

Roadmap

Final Thoughts

Author

Categories

Table of Contents

Newsletter