Thinking modes

Three levels of reasoning effort. Trade speed for depth on a per-request basis. Same model, different reasoning budgets.

Choose your reasoning level

Non-think

Default (no extra params)

Fast, direct responses. No chain-of-thought reasoning. Best for classification, extraction, routing, simple Q&A, and high-throughput batch processing.

Use when: Speed matters more than reasoning depth

Lowest token consumption

Think High

reasoning_effort: "high"

Extended chain-of-thought reasoning before generating a response. Best for analysis, complex coding, research tasks, and multi-step problem solving.

Use when: The task requires logical analysis

Moderate token overhead

Think Max

reasoning_effort: "max"

Maximum reasoning budget. The model will think extensively before answering. Best for competitive programming, deep research, and the hardest analytical problems.

Use when: Quality matters more than speed or cost

Highest token consumption

Enabling thinking mode

Pass the thinking and reasoning_effort parameters in the request body. When thinking is enabled, the model generates a chain-of-thought before producing the final answer.

thinking_mode.py
from openai import OpenAI client = OpenAI( api_key="your_continuum_key", base_url="https://api.continuum.au/v1" ) # Non-think (default) — no extra parameters needed response = client.chat.completions.create( model="deepseek-v4-flash", messages=[{"role": "user", "content": "What is 2 + 2?"}] ) # Think High — extended reasoning response = client.chat.completions.create( model="deepseek-v4-flash", messages=[{"role": "user", "content": "Analyse this contract..."}], extra_body={ "thinking": {"type": "enabled"}, "reasoning_effort": "high" } ) # Think Max — maximum reasoning budget response = client.chat.completions.create( model="deepseek-v4-flash", messages=[{"role": "user", "content": "Solve this optimisation..."}], extra_body={ "thinking": {"type": "enabled"}, "reasoning_effort": "max" } )

Accessing the reasoning chain

When thinking mode is enabled, the response includes a reasoning_content field alongside the standard content field. The reasoning chain shows how the model arrived at its answer.

access_reasoning.py
response = client.chat.completions.create( model="deepseek-v4-flash", messages=[{"role": "user", "content": "Is 9.11 greater than 9.8?"}], extra_body={ "thinking": {"type": "enabled"}, "reasoning_effort": "high" } ) message = response.choices[0].message # The reasoning chain (how the model thought about it) print("Reasoning:", message.reasoning_content) # → "Let me compare 9.11 and 9.8. Converting to same decimal places: # 9.11 vs 9.80. Since 9.80 > 9.11, 9.8 is greater than 9.11." # The final answer print("Answer:", message.content) # → "No, 9.8 is greater than 9.11."

Multi-turn conversations

How you handle reasoning_content in multi-turn conversations depends on whether the turn involved tool calls.

Turns without tool calls

The reasoning_content from the assistant message does not need to be passed back. If you include it, the API will ignore it. You can safely strip it from the conversation history.

Turns with tool calls

The reasoning_content must be passed back in subsequent requests. If your code strips it, the API returns a 400 error.

multi_turn_thinking.py
# Turn 1: User asks a question messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}] response = client.chat.completions.create( model="deepseek-v4-flash", messages=messages, extra_body={ "thinking": {"type": "enabled"}, "reasoning_effort": "high" } ) # Turn 2: Continue the conversation # reasoning_content is ignored for non-tool-call turns messages.append(response.choices[0].message) messages.append({ "role": "user", "content": "How many Rs are there in the word 'strawberry'?" }) response = client.chat.completions.create( model="deepseek-v4-flash", messages=messages, extra_body={ "thinking": {"type": "enabled"}, "reasoning_effort": "high" } )

Token billing

Thinking tokens are billed at the same per-token rate as output tokens. The difference is volume: Think High and Think Max generate additional reasoning tokens that increase the total output.

Approximate token multipliers

Non-think
1xOutput tokens only
Think High
2-5xReasoning + output tokens
Think Max
5-20xExtensive reasoning + output

Exact multipliers vary by task complexity. Monitor usage.completion_tokens in responses to track reasoning token consumption.

Parameter constraints

Certain parameters are not available when thinking mode is enabled. Setting them will not produce an error but will have no effect.

ParameterNon-thinkThink High/Max
temperature✗ (ignored)
top_p✗ (ignored)
frequency_penalty✗ (ignored)
presence_penalty✗ (ignored)
max_tokens
tools
response_format
stream

Think Max context: For Think Max mode, DeepSeek recommends allocating at least 384K tokens to the context window. This ensures the model has sufficient space for extended reasoning chains. The model will automatically manage its reasoning budget within the available context.

Start reasoning

Add thinking mode to your existing requests with two parameters. No code changes beyond that.