Models

Frontier open-weight models deployed on sovereign Australian infrastructure. Benchmarked against the best. Honest about the gaps.

DeepSeek V4-Flash

AvailableMIT Licence

Mixture-of-Experts with hybrid compressed attention. Released April 24, 2026.

284B

Total Parameters

13B

Active per Token

1,048,576

Context Window

160 GB

Checkpoint Size

MoE + CSA/HCA

Architecture

3 levels

Reasoning Modes

Text only

Modality

Preview

Status

Architecture

V4-Flash uses a Mixture-of-Experts architecture that activates only 13B of its 284B total parameters per token via top-2 routing. The hybrid attention mechanism combines Compressed Sparse Attention (CSA) for fine-grained token dependencies with Heavily Compressed Attention (HCA) for broad document-level understanding. At 1M-token context, this requires only 27% of the inference compute and 10% of the KV cache memory compared to previous generation models. The result is frontier-adjacent capability at a fraction of the infrastructure cost.

Benchmarks

Compared against frontier closed models on coding, reasoning, and long-context tasks. All figures from the DeepSeek technical report and independent evaluations, April 2026.

Coding

Benchmark	Measures	V4-Flash	V4-Pro	Claude 4.6	GPT-5.4
SWE-bench Verified	Real-world software eng.	79.0%	80.6%	80.8%	~78%
LiveCodeBench	Code generation accuracy	91.6%	93.5%	88.8%	—
Terminal-Bench 2.0	Multi-step terminal	56.9%	67.9%	65.4%	82.7%
Codeforces Elo	Competitive programming	—	3,206	—	3,168

V4-Flash is within 2 points of Claude Opus on SWE-bench Verified. The gap opens on complex multi-step terminal tasks (Terminal-Bench drops 11 points from Pro to Flash). For standard code generation, review, and refactoring, Flash is effectively equivalent to frontier models.

Reasoning and Knowledge

Benchmark	Measures	V4-Flash	V4-Pro	Claude 4.6	GPT-5.4
MMLU-Pro	Broad knowledge	Strong	Top tier	Top tier	Top tier
GPQA Diamond	Graduate-level science	Competitive	Competitive	Strong	Strong
HMMT 2026	Competition maths	—	95.2%	96.2%	97.7%
HLE	Expert cross-domain	—	37.7%	40.0%	39.8%
SimpleQA-Verified	Factual recall	—	57.9%	—	—

Closed models retain an edge on the hardest reasoning benchmarks (HLE, HMMT). For standard enterprise analytical tasks — document analysis, summarisation, extraction, report generation — V4-Flash is more than sufficient. The gap matters only at the absolute frontier of mathematical and cross-domain reasoning.

Long Context

Benchmark	Measures	V4-Pro	Claude 4.6	Gemini 3.1
MRCR 1M	Retrieval across 1M tokens	83.5	92.9	76.3
CorpusQA 1M	Document QA at 1M tokens	62.0%	—	53.8%

Retrieval accuracy is 94% at 128K tokens, 82% at 512K, and 66% at 1M. For most enterprise workloads operating at 128–256K context (annual reports, legal contracts, compliance documents), retrieval is strong. At extreme lengths, expect some degradation. V4 beats Gemini 3.1 Pro on both long-context benchmarks while Claude Opus leads on MRCR.

Reasoning modes

Three levels of reasoning effort, configurable per request. Trade speed for depth depending on the task.

Non-think

Default mode

Fast, direct responses. Best for classification, extraction, routing, simple Q&A, and high-throughput batch processing.

Lowest token consumption

Think High

reasoning_effort: "high"

Extended chain-of-thought reasoning. Best for analysis, complex coding, research tasks, and multi-step problem solving.

Moderate token overhead

Think Max

reasoning_effort: "max"

Maximum reasoning budget. Best for competitive programming, deep research, and the hardest analytical tasks. Requires 384K+ context allocation.

Highest token consumption

reasoning_modes.py

# Non-think (default) — fast, direct

response = client.chat.completions.create(

model="deepseek-v4-flash",

messages=messages

)

# Think High — extended reasoning

response = client.chat.completions.create(

model="deepseek-v4-flash",

messages=messages,

extra_body={

"thinking": {"type": "enabled"},

"reasoning_effort": "high"

}

)

# Access the reasoning chain

reasoning = response.choices[0].message.reasoning_content

answer = response.choices[0].message.content

Thinking tokens are billed at the same per-token rate. Think Max generates more output tokens, so cost per request is higher even though the rate is the same.

Where it wins. Where it doesn't.

We are transparent about what V4-Flash does well and where frontier closed models retain an edge.

Strong fit

High-volume document processing — summarisation, extraction, classification, RAG
Code generation and review — 79% SWE-bench, within 2 points of frontier
Long-context document analysis — 1M native context, 90% KV cache compression
API-compatible drop-in replacement — one line to switch from Anthropic or OpenAI
Cost-sensitive batch processing — classification, routing, extraction at scale
Sovereign data processing — MIT licensed, Australian infrastructure, zero offshore transfer

Be aware

Complex agentic workflows — multi-step tool chaining drops 11 points vs Pro on Terminal-Bench
Expert-level mathematical reasoning — Claude and GPT lead on HMMT and HLE
Creative and design tasks — formatting and output polish trails Claude
Multimodal workloads — text only, no image or audio support at this time
Mission-critical factual precision — knowledge recall trails Gemini on SimpleQA

For the 5–15% of workloads where frontier closed models are materially better, we recommend maintaining a small Anthropic or OpenAI allocation alongside Continuum. The savings on the other 85% more than fund it.

Agentic capabilities

Agent benchmarks (V4-Pro)

SWE-bench Verified— Real-world software engineering

80.6%

MCPAtlas— External tool and MCP service usage

73.6%

Toolathlon— Diverse tool generalisation

51.8%

Terminal-Bench 2.0— Multi-step terminal automation

67.9%

Benchmarks shown for V4-Pro. V4-Flash matches Pro on simple agent tasks but is weaker on complex multi-step workflows.

Tool calling features

Up to 128 tools per request
Parallel function calls
Strict schema validation mode
Thinking mode with tool calls
OpenAI-compatible tool format
$def / $ref for reusable schemas

Custom fine-tuning

Bring your own fine-tuned weights and run them on our sovereign infrastructure. Domain-specific models for legal, financial services, healthcare, or any vertical where a general-purpose model falls short. Your training data stays yours. Your weights run on dedicated capacity. MIT-licensed base models mean no licensing restrictions on derivative works.

LoRA and full fine-tuning supportedDedicated GPU allocationYour weights, your IPvLLM-compatible deployment

Ready to try it?

See the pricing, read the docs, or get started in five minutes.

View Pricing Read the Docs