Models

Frontier open-weight models deployed on sovereign Australian infrastructure. Benchmarked against the best. Honest about the gaps.

DeepSeek V4-Flash

AvailableMIT Licence

Mixture-of-Experts with hybrid compressed attention. Released April 24, 2026.

284B
Total Parameters
13B
Active per Token
1,048,576
Context Window
160 GB
Checkpoint Size
MoE + CSA/HCA
Architecture
3 levels
Reasoning Modes
Text only
Modality
Preview
Status

Architecture

V4-Flash uses a Mixture-of-Experts architecture that activates only 13B of its 284B total parameters per token via top-2 routing. The hybrid attention mechanism combines Compressed Sparse Attention (CSA) for fine-grained token dependencies with Heavily Compressed Attention (HCA) for broad document-level understanding. At 1M-token context, this requires only 27% of the inference compute and 10% of the KV cache memory compared to previous generation models. The result is frontier-adjacent capability at a fraction of the infrastructure cost.

Benchmarks

Compared against frontier closed models on coding, reasoning, and long-context tasks. All figures from the DeepSeek technical report and independent evaluations, April 2026.

Coding

BenchmarkMeasuresV4-FlashV4-ProClaude 4.6GPT-5.4
SWE-bench VerifiedReal-world software eng.79.0%80.6%80.8%~78%
LiveCodeBenchCode generation accuracy91.6%93.5%88.8%
Terminal-Bench 2.0Multi-step terminal56.9%67.9%65.4%82.7%
Codeforces EloCompetitive programming3,2063,168

V4-Flash is within 2 points of Claude Opus on SWE-bench Verified. The gap opens on complex multi-step terminal tasks (Terminal-Bench drops 11 points from Pro to Flash). For standard code generation, review, and refactoring, Flash is effectively equivalent to frontier models.

Reasoning and Knowledge

BenchmarkMeasuresV4-FlashV4-ProClaude 4.6GPT-5.4
MMLU-ProBroad knowledgeStrongTop tierTop tierTop tier
GPQA DiamondGraduate-level scienceCompetitiveCompetitiveStrongStrong
HMMT 2026Competition maths95.2%96.2%97.7%
HLEExpert cross-domain37.7%40.0%39.8%
SimpleQA-VerifiedFactual recall57.9%

Closed models retain an edge on the hardest reasoning benchmarks (HLE, HMMT). For standard enterprise analytical tasks — document analysis, summarisation, extraction, report generation — V4-Flash is more than sufficient. The gap matters only at the absolute frontier of mathematical and cross-domain reasoning.

Long Context

BenchmarkMeasuresV4-ProClaude 4.6Gemini 3.1
MRCR 1MRetrieval across 1M tokens83.592.976.3
CorpusQA 1MDocument QA at 1M tokens62.0%53.8%

Retrieval accuracy is 94% at 128K tokens, 82% at 512K, and 66% at 1M. For most enterprise workloads operating at 128–256K context (annual reports, legal contracts, compliance documents), retrieval is strong. At extreme lengths, expect some degradation. V4 beats Gemini 3.1 Pro on both long-context benchmarks while Claude Opus leads on MRCR.

Reasoning modes

Three levels of reasoning effort, configurable per request. Trade speed for depth depending on the task.

Non-think

Default mode

Fast, direct responses. Best for classification, extraction, routing, simple Q&A, and high-throughput batch processing.

Lowest token consumption

Think High

reasoning_effort: "high"

Extended chain-of-thought reasoning. Best for analysis, complex coding, research tasks, and multi-step problem solving.

Moderate token overhead

Think Max

reasoning_effort: "max"

Maximum reasoning budget. Best for competitive programming, deep research, and the hardest analytical tasks. Requires 384K+ context allocation.

Highest token consumption
reasoning_modes.py
# Non-think (default) — fast, direct
response = client.chat.completions.create(
model="deepseek-v4-flash",
messages=messages
)
# Think High — extended reasoning
response = client.chat.completions.create(
model="deepseek-v4-flash",
messages=messages,
extra_body={
"thinking": {"type": "enabled"},
"reasoning_effort": "high"
}
)
# Access the reasoning chain
reasoning = response.choices[0].message.reasoning_content
answer = response.choices[0].message.content

Thinking tokens are billed at the same per-token rate. Think Max generates more output tokens, so cost per request is higher even though the rate is the same.

Where it wins. Where it doesn't.

We are transparent about what V4-Flash does well and where frontier closed models retain an edge.

Strong fit

  • High-volume document processing — summarisation, extraction, classification, RAG
  • Code generation and review — 79% SWE-bench, within 2 points of frontier
  • Long-context document analysis — 1M native context, 90% KV cache compression
  • API-compatible drop-in replacement — one line to switch from Anthropic or OpenAI
  • Cost-sensitive batch processing — classification, routing, extraction at scale
  • Sovereign data processing — MIT licensed, Australian infrastructure, zero offshore transfer

Be aware

  • Complex agentic workflows — multi-step tool chaining drops 11 points vs Pro on Terminal-Bench
  • Expert-level mathematical reasoning — Claude and GPT lead on HMMT and HLE
  • Creative and design tasks — formatting and output polish trails Claude
  • Multimodal workloads — text only, no image or audio support at this time
  • Mission-critical factual precision — knowledge recall trails Gemini on SimpleQA

For the 5–15% of workloads where frontier closed models are materially better, we recommend maintaining a small Anthropic or OpenAI allocation alongside Continuum. The savings on the other 85% more than fund it.

Agentic capabilities

Agent benchmarks (V4-Pro)

SWE-bench VerifiedReal-world software engineering
80.6%
MCPAtlasExternal tool and MCP service usage
73.6%
ToolathlonDiverse tool generalisation
51.8%
Terminal-Bench 2.0Multi-step terminal automation
67.9%

Benchmarks shown for V4-Pro. V4-Flash matches Pro on simple agent tasks but is weaker on complex multi-step workflows.

Tool calling features

  • Up to 128 tools per request
  • Parallel function calls
  • Strict schema validation mode
  • Thinking mode with tool calls
  • OpenAI-compatible tool format
  • $def / $ref for reusable schemas

Custom fine-tuning

Bring your own fine-tuned weights and run them on our sovereign infrastructure. Domain-specific models for legal, financial services, healthcare, or any vertical where a general-purpose model falls short. Your training data stays yours. Your weights run on dedicated capacity. MIT-licensed base models mean no licensing restrictions on derivative works.

LoRA and full fine-tuning supportedDedicated GPU allocationYour weights, your IPvLLM-compatible deployment
Contact us to discuss your requirements

Ready to try it?

See the pricing, read the docs, or get started in five minutes.