Models
Frontier open-weight models deployed on sovereign Australian infrastructure. Benchmarked against the best. Honest about the gaps.
DeepSeek V4-Flash
AvailableMIT LicenceMixture-of-Experts with hybrid compressed attention. Released April 24, 2026.
Architecture
V4-Flash uses a Mixture-of-Experts architecture that activates only 13B of its 284B total parameters per token via top-2 routing. The hybrid attention mechanism combines Compressed Sparse Attention (CSA) for fine-grained token dependencies with Heavily Compressed Attention (HCA) for broad document-level understanding. At 1M-token context, this requires only 27% of the inference compute and 10% of the KV cache memory compared to previous generation models. The result is frontier-adjacent capability at a fraction of the infrastructure cost.
Benchmarks
Compared against frontier closed models on coding, reasoning, and long-context tasks. All figures from the DeepSeek technical report and independent evaluations, April 2026.
Coding
| Benchmark | Measures | V4-Flash | V4-Pro | Claude 4.6 | GPT-5.4 |
|---|---|---|---|---|---|
| SWE-bench Verified | Real-world software eng. | 79.0% | 80.6% | 80.8% | ~78% |
| LiveCodeBench | Code generation accuracy | 91.6% | 93.5% | 88.8% | — |
| Terminal-Bench 2.0 | Multi-step terminal | 56.9% | 67.9% | 65.4% | 82.7% |
| Codeforces Elo | Competitive programming | — | 3,206 | — | 3,168 |
V4-Flash is within 2 points of Claude Opus on SWE-bench Verified. The gap opens on complex multi-step terminal tasks (Terminal-Bench drops 11 points from Pro to Flash). For standard code generation, review, and refactoring, Flash is effectively equivalent to frontier models.
Reasoning and Knowledge
| Benchmark | Measures | V4-Flash | V4-Pro | Claude 4.6 | GPT-5.4 |
|---|---|---|---|---|---|
| MMLU-Pro | Broad knowledge | Strong | Top tier | Top tier | Top tier |
| GPQA Diamond | Graduate-level science | Competitive | Competitive | Strong | Strong |
| HMMT 2026 | Competition maths | — | 95.2% | 96.2% | 97.7% |
| HLE | Expert cross-domain | — | 37.7% | 40.0% | 39.8% |
| SimpleQA-Verified | Factual recall | — | 57.9% | — | — |
Closed models retain an edge on the hardest reasoning benchmarks (HLE, HMMT). For standard enterprise analytical tasks — document analysis, summarisation, extraction, report generation — V4-Flash is more than sufficient. The gap matters only at the absolute frontier of mathematical and cross-domain reasoning.
Long Context
| Benchmark | Measures | V4-Pro | Claude 4.6 | Gemini 3.1 |
|---|---|---|---|---|
| MRCR 1M | Retrieval across 1M tokens | 83.5 | 92.9 | 76.3 |
| CorpusQA 1M | Document QA at 1M tokens | 62.0% | — | 53.8% |
Retrieval accuracy is 94% at 128K tokens, 82% at 512K, and 66% at 1M. For most enterprise workloads operating at 128–256K context (annual reports, legal contracts, compliance documents), retrieval is strong. At extreme lengths, expect some degradation. V4 beats Gemini 3.1 Pro on both long-context benchmarks while Claude Opus leads on MRCR.
Reasoning modes
Three levels of reasoning effort, configurable per request. Trade speed for depth depending on the task.
Non-think
Default mode
Fast, direct responses. Best for classification, extraction, routing, simple Q&A, and high-throughput batch processing.
Lowest token consumptionThink High
reasoning_effort: "high"
Extended chain-of-thought reasoning. Best for analysis, complex coding, research tasks, and multi-step problem solving.
Moderate token overheadThink Max
reasoning_effort: "max"
Maximum reasoning budget. Best for competitive programming, deep research, and the hardest analytical tasks. Requires 384K+ context allocation.
Highest token consumptionThinking tokens are billed at the same per-token rate. Think Max generates more output tokens, so cost per request is higher even though the rate is the same.
Where it wins. Where it doesn't.
We are transparent about what V4-Flash does well and where frontier closed models retain an edge.
Strong fit
- High-volume document processing — summarisation, extraction, classification, RAG
- Code generation and review — 79% SWE-bench, within 2 points of frontier
- Long-context document analysis — 1M native context, 90% KV cache compression
- API-compatible drop-in replacement — one line to switch from Anthropic or OpenAI
- Cost-sensitive batch processing — classification, routing, extraction at scale
- Sovereign data processing — MIT licensed, Australian infrastructure, zero offshore transfer
Be aware
- Complex agentic workflows — multi-step tool chaining drops 11 points vs Pro on Terminal-Bench
- Expert-level mathematical reasoning — Claude and GPT lead on HMMT and HLE
- Creative and design tasks — formatting and output polish trails Claude
- Multimodal workloads — text only, no image or audio support at this time
- Mission-critical factual precision — knowledge recall trails Gemini on SimpleQA
For the 5–15% of workloads where frontier closed models are materially better, we recommend maintaining a small Anthropic or OpenAI allocation alongside Continuum. The savings on the other 85% more than fund it.
Agentic capabilities
Agent benchmarks (V4-Pro)
Benchmarks shown for V4-Pro. V4-Flash matches Pro on simple agent tasks but is weaker on complex multi-step workflows.
Tool calling features
- Up to 128 tools per request
- Parallel function calls
- Strict schema validation mode
- Thinking mode with tool calls
- OpenAI-compatible tool format
- $def / $ref for reusable schemas
Custom fine-tuning
Bring your own fine-tuned weights and run them on our sovereign infrastructure. Domain-specific models for legal, financial services, healthcare, or any vertical where a general-purpose model falls short. Your training data stays yours. Your weights run on dedicated capacity. MIT-licensed base models mean no licensing restrictions on derivative works.
Ready to try it?
See the pricing, read the docs, or get started in five minutes.