Model Module
Overview
The Model module is the large language model abstraction layer of the tRPC-Agent-Go framework, providing a unified LLM interface design that currently supports OpenAI-compatible and Anthropic-compatible API calls. Through standardized interface design, developers can flexibly switch between different model providers, achieving seamless model integration and invocation. This module has been verified to be compatible with most OpenAI-like interfaces both inside and outside the company.
The Model module has the following core features:
- Unified Interface Abstraction: Provides standardized
Modelinterface, shielding differences between model providers - Streaming Response Support: Native support for streaming output, enabling real-time interactive experience
- Multimodal Capabilities: Supports text, image, audio, and other multimodal content processing
- Complete Error Handling: Provides dual-layer error handling mechanism, distinguishing between system errors and API errors
- Extensible Configuration: Supports rich custom configuration options to meet different scenario requirements
Quick Start
Using Model in Agent
Example code is located at examples/runner
Usage Methods and Platform Integration Guide
The Model module supports multiple usage methods and platform integration. The following are common usage scenarios based on Runner examples:
Quick Start
Platform Integration Configuration
All platform integration methods follow the same pattern, only requiring configuration of different environment variables or direct setting in code:
Environment Variable Method (Recommended):
Code Method:
Supported Platforms and Their Configuration
The following are configuration examples for each platform, divided into environment variable configuration and code configuration methods:
Environment Variable Configuration
The runner example supports specifying model names through command line parameters (-model), which is actually passing the model name when calling openai.New().
Code Configuration Method
Configuration method when directly using Model in your own code:
Core Interface Design
Model Interface
Channel-based streaming typically requires a dedicated goroutine and incurs channel synchronization on each chunk. In high-frequency streaming, this overhead can become a measurable cost. IterModel is an optional iterator-style API that streams responses synchronously in the caller goroutine to reduce this overhead.
When a model implements IterModel, the framework uses GenerateContentIter; otherwise it uses GenerateContent. Implementations must call yield sequentially and stop when it returns false.
Request Structure
GenerationConfig itself is just a plain struct, and its zero value
means Stream=false. For LLMAgent, if you omit
llmagent.WithGenerationConfig(...), the framework forwards that zero
value as-is, so the default behavior is non-streaming. Higher-level
wrappers can still choose different semantics by setting their own
explicit defaults.
Response Structure
For OpenAI-compatible providers, completion_tokens_details.reasoning_tokens is mapped to Usage.CompletionTokensDetails.ReasoningTokens. The value may be 0 when the provider does not spend or report reasoning tokens; for reasoning models, set ReasoningEffort and/or ThinkingEnabled when you want to request reasoning behavior.
OpenAI Model
Model Name Parameter
When creating an OpenAI model instance using openai.New(name string, opts ...Option), the first parameter is the actual model name that gets sent to the OpenAI API, as the specific model identifier that tells the API which language model to use.
Since the framework supports different models compatible with the OpenAI API, you can obtain the base URL, API key, and model name from various model providers:
1. OpenAI Official
- Base URL:
https://api.openai.com/v1 - Model Names:
gpt-4o,gpt-4o-mini, etc.
2. DeepSeek
- Base URL:
https://api.deepseek.com - Model Names:
deepseek-v4-flash,deepseek-v4-pro
deepseek-chat and deepseek-reasoner are deprecated compatibility aliases;
prefer the explicit DeepSeek v4 model names for new code.
3. Tencent Hunyuan
- Base URL:
https://api.hunyuan.cloud.tencent.com/v1 - Model Names:
hunyuan-2.0-thinking-20251109,hunyuan-2.0-instruct-20251111, etc.
4. Other Providers
- Qwen: Base URL
https://dashscope.aliyuncs.com/compatible-mode/v1, Model Names: various qwen models
The OpenAI Model is used to interface with OpenAI and its compatible platforms. It supports streaming output, multimodal and advanced parameter configuration, and provides rich callback mechanisms, batch processing and retry capabilities. It also allows for flexible setting of custom HTTP headers.
Configuration Method
Environment Variable Method
Code Method
Direct Model Usage
Structured Output
When calling model.GenerateContent directly, use model.NewRequest and
model.WithStructuredOutputJSON to generate a JSON schema from a Go struct and
pass it to model adapters that support provider-native structured output.
WithStructuredOutputJSON only configures the model request. Direct model
callers still receive normal model.Response values and should unmarshal the
final JSON content themselves. If you already have a hand-written schema, set
request.StructuredOutput directly.
Streaming Output
Advanced Parameter Configuration
Multimodal Content
Advanced Features
1. Callback Functions
Dynamically Modifying Request Body via Callback
WithChatRequestCallback receives a *openai.ChatCompletionNewParams pointer,
allowing you to dynamically add or modify fields in the HTTP request body
before each request is sent. Use SetExtraFields on the params to inject
custom JSON fields:
Difference between request-body customization approaches:
| Aspect | WithExtraFields |
agent.WithModelRequestExtraFields |
SetExtraFields in WithChatRequestCallback |
|---|---|---|---|
| Timing | Set once at model creation | Set on one runner.Run(...) call |
Called before every model request |
| Dynamism | Static values only | Dynamic values known at run time | Dynamic values based on ctx or runtime state |
| Mechanism | Injected via openaiopt.WithJSONSet (RequestOption layer) |
Copied into model.Request.ExtraFields, then injected by supported adapters |
Set on the ChatCompletionNewParams struct (serialization layer) |
| Same key conflict | Overwritten by request-level extra fields | Takes precedence over model-level extra fields | Overwritten by RequestOption-layer extra fields if the same key exists |
When multiple approaches use different keys, all fields appear in the final
JSON body without conflict. When they set the same key, request-level extra
fields passed with agent.WithModelRequestExtraFields take precedence in
OpenAI-compatible adapters.
Request-scoped extra fields from Runner
Use agent.WithModelRequestExtraFields when a provider-specific top-level
request body field should vary per runner.Run(...) call, for example
prompt_cache_key on OpenAI-compatible endpoints:
When a provider requires both a request body cache key and a routing header for the same conversation, combine this with request-scoped headers:
This option applies to every model call created during that run, including normal LLM agents, graph LLM nodes, and requests routed through failover or hedge models. The built-in OpenAI and HuggingFace/OpenAI-compatible adapters merge these fields into the top-level JSON request body. Other provider adapters ignore the field unless they add an explicit provider-specific mapping.
2. Model Switching
Model switching allows dynamically changing the LLM model used by an Agent at runtime. This section shows static switching for OpenAI and OpenAI-compatible model instances at the agent level and per-request level. To select a model separately for each LLM call within the same runner.Run(...), use ModelSelector.
Agent-level Switching
Agent-level switching changes the Agent's default model, affecting all subsequent requests.
Approach 1: Direct Model Instance
Set the model directly by passing a model instance to SetModel:
Use Cases:
Approach 2: Switch by Name
Pre-register multiple models with WithModels, then switch by name using SetModelByName:
Use Cases:
Per-request Switching
Per-request switching allows temporarily specifying a model for a single request without affecting the Agent's default model or other requests. This is useful for scenarios where different models are needed for specific tasks.
Approach 1: Using WithModel Option
Use agent.WithModel to specify a model instance for a single request:
Approach 2: Using WithModelName Option (Recommended)
Use agent.WithModelName to specify a pre-registered model name for a single request:
Use Cases:
Configuration Details
WithModels Option:
- Accepts a
map[string]model.Modelwhere key is the model name and value is the model instance - If both
WithModelandWithModelsare set,WithModelspecifies the initial model - If only
WithModelsis set, the first model in the map will be used as the initial model (note: map iteration order is not guaranteed, so it's recommended to explicitly specify the initial model) - Reserved name:
__default__is used internally by the framework and should not be used
SetModelByName Method:
- Parameter: model name (string)
- Returns: error if the model name is not found
- The model must be pre-registered via
WithModels
Per-request Options:
agent.RunOptions.Model: Directly specify a model instanceagent.RunOptions.ModelName: Specify a pre-registered model nameagent.RunOptions.Stream: Override whether responses are streamed (useagent.WithStream(...))agent.RunOptions.Instruction: Override instruction for this request only (useagent.WithInstruction(...))agent.RunOptions.GlobalInstruction: Override global instruction (system prompt) for this request only (useagent.WithGlobalInstruction(...))- Priority:
Model>ModelName> Agent default model - If the model specified by
ModelNameis not found, it falls back to the Agent's default model
You can set streaming per request using agent.WithStream(true) or
agent.WithStream(false).
Agent-level vs Per-request Comparison
| Feature | Agent-level Switching | Per-request Switching |
|---|---|---|
| Scope | All subsequent requests | Current request only |
| Usage | SetModel/SetModelByName |
RunOptions.Model/ModelName |
| State Change | Changes Agent default model | Does not change Agent state |
| Use Case | Global strategy adjustment | Specific task temporary needs |
| Concurrency | Affects all concurrent reqs | Does not affect other requests |
| Typical Examples | User tier, time-based policy | Complex queries, reasoning |
Agent-level Approach Comparison
| Feature | SetModel | SetModelByName |
|---|---|---|
| Usage | Pass model instance | Pass model name |
| Pre-registration | Not required | Required via WithModels |
| Error Handling | None | Returns error |
| Use Case | Simple switching | Complex scenarios, multi-model management |
| Code Maintenance | Need to hold model instances | Only need to remember names |
Important Notes
Agent-level Switching:
- Immediate Effect: After calling
SetModelorSetModelByName, the next request immediately uses the new model - Session Persistence: Switching models does not clear session history
- Independent Configuration: Each model retains its own configuration (temperature, max tokens, etc.)
- Concurrency Safe: Both switching approaches are concurrency-safe
Per-request Switching:
- Temporary Override: Only affects the current request, does not change the Agent's default model
- Higher Priority: Per-request model settings take precedence over the Agent's default model
- No Side Effects: Does not affect other concurrent requests or subsequent requests
- Flexible Combination: Can be used in combination with agent-level switching
Model-specific Prompts (LLMAgent):
- Use
llmagent.WithModelInstructions/llmagent.WithModelGlobalInstructionsto override prompts bymodel.Info().Namewhen the Agent switches models; it falls back to the Agent defaults when no mapping exists. - For a runnable example, see examples/model/promptmap.
Usage Example
For a complete interactive example, see examples/model/switch, which demonstrates both agent-level and per-request switching approaches.
3. Batch Processing (Batch API)
Batch API is an asynchronous batch processing technique for efficiently handling large volumes of requests. This feature is particularly suitable for scenarios requiring large-scale data processing, significantly reducing costs and improving processing efficiency.
Core Features
- Asynchronous Processing: Batch requests are processed asynchronously without waiting for immediate responses
- Cost Optimization: Typically more cost-effective than individual requests
- Flexible Input: Supports both inline requests and file-based input
- Complete Management: Provides full operations including create, retrieve, cancel, and list
- Result Parsing: Automatically downloads and parses batch processing results
Quick Start
Creating a Batch Job:
Batch Operations
Retrieving Batch Status:
Downloading and Parsing Results:
Canceling a Batch Job:
Listing Batch Jobs:
Configuration Options
Global Configuration:
Request-level Configuration:
How It Works
Batch API execution flow:
Key design:
- CustomID Uniqueness: Each request must have a unique CustomID for matching input/output
- JSONL Format: Batch processing uses JSONL (JSON Lines) format for storing requests and responses
- Asynchronous Processing: Batch jobs execute asynchronously in the background without blocking main flow
- Completion Window: Configurable completion time window for batch processing (e.g., 24h)
Use Cases
- Large-scale Data Processing: Processing thousands or tens of thousands of requests
- Offline Analysis: Non-real-time data analysis and processing tasks
- Cost Optimization: Batch processing is typically more economical than individual requests
- Scheduled Tasks: Regularly executed batch processing jobs
Usage Example
For a complete interactive example, see examples/model/batch.
4. Retry Mechanism
The retry mechanism is an automatic error recovery technique that automatically retries failed requests. This feature is provided by the underlying OpenAI SDK, with the framework passing retry parameters to the SDK through configuration options.
Timeouts and deadlines
Request lifecycle is bounded by two independent limits:
- The caller context deadline (for example, Runner max duration, or
context.WithTimeout). - The OpenAI request timeout configured by
openaiopt.WithRequestTimeout.
The effective budget is the earlier one:
- effective_deadline = min(ctx_deadline, request_timeout)
Important notes:
github.com/openai/openai-godoes not hardcode timeout by default. If you observe timeout in logs, it typically comes from an upstream deadline (gateway/caller context) or from your ownWithRequestTimeoutconfiguration.- If you expect long-running calls (streaming, large prompts, tools, or reasoning models), configure
WithRequestTimeoutto match your service deadline and service level objective (SLO).
Core Features
- Automatic Retry: SDK automatically handles retryable errors
- Smart Backoff: Follows API's
Retry-Afterheaders or uses exponential backoff - Configurable: Supports custom maximum retry count and timeout duration
- Zero Maintenance: No custom retry logic needed, handled by mature SDK
Quick Start
Basic Configuration:
Retryable Errors
The OpenAI SDK automatically retries the following errors:
- 408 Request Timeout: Request timeout
- 409 Conflict: Conflict error
- 429 Too Many Requests: Rate limiting
- 500+ Server Errors: Internal server errors (5xx)
- Network Connection Errors: No response or connection failure
Note: SDK default maximum retry count is 2.
Retry Strategies
Standard Retry:
Rate Limiting Optimization:
Fast Fail:
How It Works
Retry mechanism execution flow:
Key design:
- SDK-level Implementation: Retry logic is completely handled by OpenAI SDK
- Configuration Pass-through: Framework passes configuration via
WithOpenAIOptions - Smart Backoff: Prioritizes using
Retry-Afterheader returned by API - Transparent Handling: Transparent to application layer, no additional code needed
Use Cases
- Production Environment: Improve service reliability and fault tolerance
- Rate Limiting: Automatically handle 429 errors
- Network Instability: Handle temporary network failures
- Server Errors: Handle temporary server-side issues
Important Notes
- No Framework Retry: Framework itself does not implement retry logic
- Client-level Retry: All retry is handled by OpenAI client
- Configuration Pass-through: Use
WithOpenAIOptionsto configure retry behavior - Automatic Handling: Rate limiting (429) is automatically handled without additional code
Usage Example
For a complete interactive example, see examples/model/retry.
5. Custom HTTP Headers
In some enterprise or proxy scenarios, the model provider requires additional HTTP headers (for example, organization ID, tenant routing, or custom authentication). The Model module supports setting headers in several reliable ways.
Recommended order:
- Global header via
openai.WithHeaders(simplest for static headers) - Request-scoped header via
agent.WithModelRequestHeaders(forrunner.Run(...)calls whose headers vary by user, session, or tenant) - Global header via OpenAI RequestOption (flexible, middleware-friendly)
- Custom
http.RoundTripper(advanced, cross-cutting)
All methods affect streaming too because the same client is used for
New and NewStreaming calls.
1. Using openai.WithHeaders for headers
2. Request-scoped headers from Runner
Use agent.WithModelRequestHeaders when a header should vary per
runner.Run(...) call, for example X-Session-ID on providers that route
a conversation to the same inference instance. The OpenAI adapter merges
these headers into the HTTP request after model-level headers, so request-level
values take precedence on the same key.
3. Global headers using OpenAI RequestOption
Use WithOpenAIOptions with openaiopt.WithHeader or
openaiopt.WithMiddleware to inject headers for every request created
by the underlying OpenAI client.
For complex logic, middleware lets you modify headers conditionally (for example, by URL path or context values):
Notes for authentication variants:
- OpenAI style: keep
openai.WithAPIKey("sk-...")which setsAuthorization: Bearer ...under the hood. - Azure/OpenAI‑compatible that use
api-key: omitWithAPIKeyand setopenaiopt.WithHeader("api-key", "<key>")instead.
Logging raw HTTP request and response
You can use openaiopt.WithMiddleware to log the underlying HTTP request and
response. Be careful about secrets (API keys, Authorization headers) and body
consumption.
Key points:
- Reading
req.Bodyorresp.Bodyconsumes the stream, so you must restore it. - Do not read
resp.Bodyfor streaming responses (for example,Content-Type: text/event-stream); skip body logging to avoid blocking or breaking the stream.
4. Custom http.RoundTripper (advanced)
Inject headers across all requests at the HTTP layer by wrapping the transport. This is useful when you also need custom proxy, TLS, or metrics logic.
Per-request headers
- Agent/Runner passes
ctxthrough to the model call; middleware can read values fromreq.Context()to inject per-invocation headers. - Chat completion per-request base URL override is not exposed; create a
second model with a different base URL or alter
r.URLin middleware.
6. Token Counter
The Token Counter is used to estimate the token count of text content. The framework provides SimpleTokenCounter as the default implementation, supporting custom token counting logic to meet different scenario requirements.
SimpleTokenCounter
SimpleTokenCounter provides a very rough token estimation based on UTF-8 character count. Its features include:
Core Features:
- Lightweight: Does not depend on external tokenizers, only uses character length estimation
- Model-agnostic: Suitable for all OpenAI-compatible models
- Multimodal Support: Supports message content, reasoning content (ReasoningContent), and tool calls
Estimation Principle:
Uses heuristic rules: approximately N UTF-8 characters per token, where N can be configured via WithApproxRunesPerToken. N is a divisor, not a multiplier:
Therefore, WithApproxRunesPerToken(1.5) means approximately 1.5 characters per token. Passing 2.0/3.0 means approximately 0.67 characters per token, which is about 1.5 tokens per character.
Usage:
Parameter Description:
WithApproxRunesPerToken(v float64)
- Purpose: Set the approximate number of characters per token
- Type: float64
- Default Value: 4.0 (approx. 4 characters per token, suitable for English scenarios)
- Value Constraint: Values <= 0 will be ignored, keeping the default value
- Formula: estimated tokens = counted UTF-8 characters /
v; for example,v=1.5means approximately1.5characters per token
Recommended Values for Common Languages:
| Language/Scenario | Recommended Value | Description |
|---|---|---|
| English text | 4.0 | Default value, suitable for English words and common Latin characters |
| Chinese text | 1.6-1.8 | Chinese characters have higher token density than English, use smaller values |
| Japanese text | 2.0-2.5 | Japanese character token density is between Chinese and English |
| Mixed text | 3.0 | Trade-off value for mixed Chinese/English scenarios |
Important Notes:
- Empirical Values: The recommended values above are empirical estimates. Actual token count depends on the specific model's tokenizer implementation.
- Uncertainty: If you're uncertain about language mix or model characteristics, keep the default value (4.0).
- Model Differences: Different model vendors (OpenAI, DeepSeek, Hunyuan) have different tokenizer implementations, which may require adjusting this parameter.
- Accuracy Consideration:
SimpleTokenCounteronly provides rough estimation. If you need accurate token counting, consider using the model provider's tokenizer API (e.g., OpenAI's tiktoken).
Custom Token Counter
If SimpleTokenCounter's rough estimation doesn't meet your requirements, you can implement a custom TokenCounter interface:
Using Token Counter in Summarizer
SimpleTokenCounter can also be used with the session/summary module to control summary triggering:
7. Token Tailoring
Token Tailoring is an intelligent message management technique designed to automatically trim messages when they exceed the model's context window limits, ensuring requests can be successfully sent to the LLM API. This feature is particularly useful for long conversation scenarios, allowing you to keep the message list within the model's token limits while preserving key context.
Automatic Mode (Recommended):
Advanced Mode:
Token Calculation Formula:
The framework automatically calculates maxInputTokens based on the model's
context window. For OpenAI-compatible models, the automatic budget also accounts
for request-scoped output limits and tool definitions:
Context Window Registration
Token Tailoring and session summary
WithContextThresholdboth need a model context window. Built-in model names are resolved automatically. Token Tailoring uses a 128000-token fallback for unknown model names. For private deployments, tenant-provided models, endpoint IDs, or smaller unknown models, prefer model-instance configuration such asopenai.WithContextWindow(32768)or the unifiedprovider.WithContextWindow(32768). For one-off runs, useagent.WithModelContextWindow(32768). Usemodel.RegisterModelContextWindow("my-model", 32768)only when the name has a stable process-wide meaning. See the Session Summary documentation for a full example.
For example, "gpt-4o" (contextWindow = 128000):
If WithMaxInputTokens is set explicitly, the framework keeps that value as the
configured message budget and does not subtract the estimated Tools schema
budget from it. The explicit value is still clamped to the context-safe hard
budget after output reserves and safety margin are applied.
Default Budget Parameters:
The framework uses the following default values for token allocation (it is recommended to keep the defaults):
- Protocol Overhead (ProtocolOverheadTokens): 512 tokens - reserved for request/response formatting
- Output Reserve (ReserveOutputTokens): 2048 tokens - minimum reserve for output generation; OpenAI-compatible requests reserve the larger value among this setting,
GenerationConfig.MaxTokens, andGenerationConfig.ThinkingTokens - Input Floor (InputTokensFloor): 1024 tokens - ensures proper model processing
- Output Floor (OutputTokensFloor): deprecated and no longer used to auto-calculate output
MaxTokens - Safety Margin Ratio (SafetyMarginRatio): 10% - buffer for token counting inaccuracies
- Max Input Ratio (MaxInputTokensRatio): 100% - maximum input ratio of context window
Tailoring Strategy:
The framework provides a default tailoring strategy that preserves messages according to the following priorities:
- System Messages: Highest priority, always preserved
- Latest User Message: Ensures the current conversation turn is complete
- Tool Call Related Messages: Maintains tool call context integrity
- Historical Messages: Retains as much conversation history as possible based on remaining space
Custom Tailoring Strategy:
You can implement the TailoringStrategy interface to customize the trimming logic:
Advanced Configuration (Custom Budget Parameters):
If the default token allocation strategy does not meet your needs, you can customize the budget parameters using WithTokenTailoringConfig. Note: It is recommended to keep the default values unless you have specific requirements.
For Anthropic models, you can use the same configuration:
7. Variant Optimization: Adapting to Platform-Specific Behaviors
The Variant mechanism is an important optimization in the Model module, used to handle platform-specific behavioral differences across OpenAI-compatible providers. By specifying different Variants, the framework can automatically adapt to API differences between platforms, especially for file upload, deletion, and processing logic.
7.1. Supported Variant Types
The framework currently supports the following Variants:
1. VariantOpenAI(default)
- Standard OpenAI API-compatible behavior
- File upload path:
/openapi/v1/files - File purpose:
user_data - File deletion Http method::
DELETE
2. VariantHunyuan(hunyuan)
- Tencent Hunyuan platform-specific adaptation
- File upload path::
/openapi/v1/files/uploads - File purpose:
file-extract - File deletion Http Method:
POST
3. VariantDeepSeek
- DeepSeek platform adaptation
- Default BaseURL:
https://api.deepseek.com - API Key environment variable name:
DEEPSEEK_API_KEY - DeepSeek-specific behavior is enabled when you explicitly set
WithVariant(openai.VariantDeepSeek)or use the official DeepSeek API BaseURL - Other behaviors are consistent with standard OpenAI
4. VariantQwen(Qwen)
- Qwen platform adaptation
- Default BaseURL:
https://dashscope.aliyuncs.com/compatible-mode/v1 - API Key environment variable name:
DASHSCOPE_API_KEY - Other behaviors are consistent with standard OpenAI
7.2. Usage
Usage Example:
7.3. Behavioral Differences of Variants Examples
Message content handling differences:
Environment variable auto-configuration
For certain Variants, the framework supports reading configuration from environment variables automatically:
8. Streaming Tool Call Deltas: ShowToolCallDelta
By default, the OpenAI adapter suppresses raw tool_calls chunks in streaming
responses. Tool calls are accumulated internally and only exposed once in the
final aggregated response via Response.Choices[0].Message.ToolCalls. This
keeps the stream clean for typical chat UIs that only render assistant text.
For advanced use cases (for example, when the model streams document content
inside tool arguments and you need to display it incrementally), you can turn
on raw tool call deltas with WithShowToolCallDelta:
When WithShowToolCallDelta(true) is enabled:
- Streaming chunks that contain
tool_callsare no longer suppressed by the adapter. - Each chunk is converted into a partial
model.Responsewhere:Response.IsPartial == trueResponse.Choices[0].Delta.ToolCallscontains the provider’s rawtool_callsdelta mapped tomodel.ToolCall:Typecomes from the providertypefield (for example,"function").Function.NameandFunction.Argumentsmirror the original tool name and JSON-encoded arguments string.IDandIndexpreserve the tool call identity so callers can stitch fragments together.
- The final aggregated response still exposes the merged tool calls in
Response.Choices[0].Message.ToolCalls, so existing tool execution logic (for example,FunctionCallResponseProcessor) continues to work unchanged.
Typical integration pattern when this flag is enabled:
- Read
Response.Choices[0].Delta.ToolCalls[*].Function.Argumentson each partial response. - Group chunks by tool call
IDand append theArgumentsfragments in order. - Once the accumulated string forms valid JSON, unmarshal it into your
business struct (for example,
{ "content": "..." }) and use it for progressive UI rendering.
If you do not need to inspect tool arguments during streaming, keep
WithShowToolCallDelta disabled to avoid handling partial JSON fragments and
to preserve the default clean text-streaming behavior.
Anthropic Model
Anthropic Model is used to interface with Claude models and compatible platforms, supporting streaming output, thought modes and tool calls, and providing a rich callback mechanism, while also allowing for flexible configuration of custom HTTP headers.
Configuration Method
Environment Variable Method
Code Method
Using the Model Directly
Streaming Output
Advanced Parameter Configuration
Multimodal Input
The Anthropic Model supports images and files in user messages:
- Images: Provide an image URL or raw image data. Supported MIME types are
image/jpeg,image/png,image/gif, andimage/webp. Images are sent as Anthropic image blocks. - PDFs: Provide a URL that Claude can access or raw data, using
application/pdf. PDFs are sent as Anthropic document blocks. - Text content: Use raw data only.
text/plainis sent as an Anthropic document block; text-based MIME types such astext/csvortext/htmlare sent as Anthropic text blocks. - Other formats: Office documents, JSON, and other formats are not parsed automatically. Convert them to text or PDF first. Audio and
FileIDare not supported. Non-PDF file URLs are sent as URL text only.
Advanced features
1. Callback Functions
2. Model Switching
Model switching allows dynamically changing the LLM model used by an Agent at runtime. This section shows static switching for Anthropic model instances at the agent level and per-request level. To select a model separately for each LLM call within the same runner.Run(...), use ModelSelector.
Agent-level Switching
Agent-level switching changes the Agent's default model, affecting all subsequent requests.
Approach 1: Direct Model Instance
Set the model directly by passing a model instance to SetModel:
Use Cases:
Approach 2: Switch by Name
Pre-register multiple models with WithModels, then switch by name using SetModelByName:
Use Cases:
Per-request Switching
Per-request switching allows temporarily specifying a model for a single request without affecting the Agent's default model or other requests. This is useful for scenarios where different models are needed for specific tasks.
Approach 1: Using WithModel Option
Use agent.WithModel to specify a model instance for a single request:
Approach 2: Using WithModelName Option (Recommended)
Use agent.WithModelName to specify a pre-registered model name for a single request:
Use Cases:
Configuration Details
WithModels Option:
- Accepts a
map[string]model.Modelwhere key is the model name and value is the model instance - If both
WithModelandWithModelsare set,WithModelspecifies the initial model - If only
WithModelsis set, the first model in the map will be used as the initial model (note: map iteration order is not guaranteed, so it's recommended to explicitly specify the initial model) - Reserved name:
__default__is used internally by the framework and should not be used
SetModelByName Method:
- Parameter: model name (string)
- Returns: error if the model name is not found
- The model must be pre-registered via
WithModels
Per-request Options:
agent.RunOptions.Model: Directly specify a model instanceagent.RunOptions.ModelName: Specify a pre-registered model name- Priority:
Model>ModelName> Agent default model - If the model specified by
ModelNameis not found, it falls back to the Agent's default model
Agent-level vs Per-request Comparison
| Feature | Agent-level Switching | Per-request Switching |
|---|---|---|
| Scope | All subsequent requests | Current request only |
| Usage | SetModel/SetModelByName |
RunOptions.Model/ModelName |
| State Change | Changes Agent default model | Does not change Agent state |
| Use Case | Global strategy adjustment | Specific task temporary needs |
| Concurrency | Affects all concurrent reqs | Does not affect other requests |
| Typical Examples | User tier, time-based policy | Complex queries, reasoning |
Agent-level Approach Comparison
| Feature | SetModel | SetModelByName |
|---|---|---|
| Usage | Pass model instance | Pass model name |
| Pre-registration | Not required | Required via WithModels |
| Error Handling | None | Returns error |
| Use Case | Simple switching | Complex scenarios, multi-model management |
| Code Maintenance | Need to hold model instances | Only need to remember names |
Important Notes
Agent-level Switching:
- Immediate Effect: After calling
SetModelorSetModelByName, the next request immediately uses the new model - Session Persistence: Switching models does not clear session history
- Independent Configuration: Each model retains its own configuration (temperature, max tokens, etc.)
- Concurrency Safe: Both switching approaches are concurrency-safe
Per-request Switching:
- Temporary Override: Only affects the current request, does not change the Agent's default model
- Higher Priority: Per-request model settings take precedence over the Agent's default model
- No Side Effects: Does not affect other concurrent requests or subsequent requests
- Flexible Combination: Can be used in combination with agent-level switching
Model-specific Prompts (LLMAgent):
- Use
llmagent.WithModelInstructions/llmagent.WithModelGlobalInstructionsto override prompts bymodel.Info().Namewhen the Agent switches models; it falls back to the Agent defaults when no mapping exists. - For a runnable example, see examples/model/promptmap.
Usage Example
For a complete interactive example, see examples/model/switch, which demonstrates both agent-level and per-request switching approaches.
3. Custom HTTP Headers
In environments like gateways, proprietary platforms, or proxy setups, model API requests often require additional HTTP headers (e.g., organization/tenant identifiers, grayscale routing, custom authentication, etc.). The Model module provides three reliable ways to add headers for "all model requests," including standard requests, streaming, file uploads, batch processing, etc.
Recommended order:
- Global header via
anthropic.WithHeaders(simplest for static headers) - Use Anthropic RequestOption to set global headers (flexible, middleware-friendly)
- Use a custom
http.RoundTripperinjection (advanced, more cross-cutting capabilities)
All methods affect streaming requests, as they use the same underlying client.
1. Using anthropic.WithHeaders for headers
2. Using Anthropic RequestOption to Set Global Headers
By using WithAnthropicClientOptions combined with anthropicopt.WithHeader or anthropicopt.WithMiddleware, you can inject headers into every request made by the underlying Anthropic client.
If you need to set headers conditionally (e.g., only for certain paths or depending on context values), you can use middleware:
Regarding "per-request" headers:
- The Agent/Runner will propagate
ctxto the model call; middleware can read the value fromreq.Context()to inject headers for "this call." - For message completion, the current API doesn't expose per-call BaseURL overrides; if switching is needed, create a model using a different BaseURL or modify the
r.URLin middleware.
4. Token Tailoring
Anthropic models also support Token Tailoring functionality, designed to automatically trim messages when they exceed the model's context window limits, ensuring requests can be successfully sent to the LLM API.
Automatic Mode (Recommended):
Advanced Mode:
For detailed explanations of the token calculation formula, tailoring strategy, and custom strategy implementation, please refer to Token Tailoring under OpenAI Model.
5. Streaming Tool Call Deltas: ShowToolCallDelta
By default, the Anthropic adapter accumulates tool-call arguments internally and returns the complete arguments once the model finishes the tool call, through Response.Choices[0].Message.ToolCalls in the final response.
WithShowToolCallDelta exposes Anthropic input_json_delta chunks during streaming. Enable this option when tool arguments are long, or when the frontend needs to display argument generation progress:
When WithShowToolCallDelta(true) is enabled:
- Streaming responses may include
Response.Choices[0].Delta.ToolCalls. Delta.ToolCalls[*].Function.Argumentsis the newly produced argument string fragment, and is usually not complete JSON.Delta.ToolCalls[*].IDandIndexcan be used to join fragments for the same tool call.- The final response still returns the complete tool call in
Response.Choices[0].Message.ToolCalls, so existing tool execution logic can continue to use the final response unchanged.
Typical handling pattern:
- Read
Response.Choices[0].Delta.ToolCalls[*].Function.Argumentson each partial response. - Group chunks by tool call
IDorIndex, and append theArgumentsfragments in order. - Treat the accumulated value as the tool argument JSON. For progressive rendering, render the string as it grows and unmarshal it only after it becomes valid JSON.
Provider
With the emergence of multiple large model providers, some have defined their own API specifications. Currently, the framework has integrated the APIs of OpenAI and Anthropic, and exposes them as models. Users can access different provider models through openai.New and anthropic.New.
However, there are differences in instantiation and configuration between providers, which often requires developers to modify a significant amount of code when switching between providers, increasing the cost of switching.
To solve this problem, the Provider offers a unified model instantiation entry point. Developers only need to specify the provider and model name, and other configuration options are managed through the unified Option, simplifying the complexity of switching between providers.
The Provider supports the following Option:
| Option | Description |
|---|---|
WithAPIKey / WithBaseURL |
Set the API Key and Base URL for the model |
WithHTTPClientName / WithHTTPClientTransport |
Configure HTTP client properties |
WithHeaders |
Append static HTTP headers across requests |
WithChannelBufferSize |
Adjust the response channel buffer size |
WithCallbacks |
Configure OpenAI / Anthropic request, response, and streaming callbacks |
WithExtraFields |
Configure custom fields in the request body |
WithEnableTokenTailoring / WithMaxInputTokensWithTokenCounter / WithTailoringStrategy |
Token trimming related parameters |
WithTokenTailoringConfig |
Custom token tailoring budget parameters for advanced configuration |
WithOpenAIOption / WithAnthropicOption |
Pass-through native options for the respective providers |
Usage Example
Advanced Configuration with TokenTailoringConfig:
For advanced users who need to fine-tune token allocation strategy, you can use WithTokenTailoringConfig:
Full code can be found in examples/provider.
Registering a Custom Provider
The framework supports registering custom providers to integrate other large model providers or custom model implementations.
Using provider.Register, you can define a method to create custom model instances based on the options.
Model Failover
model/failover provides a priority-ordered fallback wrapper for models. It can compose multiple model.Model instances into a primary/backup chain and automatically switch to the next candidate when the current one is unavailable. This is useful for scenarios such as primary/backup domains, gateways, or model providers.
Quick Start:
failover.New(...) returns a regular model.Model, so it can be passed directly to places that accept model.Model, such as llmagent.WithModel(...). For a complete example, see examples/model/failover.
Switching Rules:
- Candidates are tried in the order provided to
WithCandidates(...). - Switching is allowed only before the first non-error chunk is received.
- If the current candidate returns an
errordirectly before that point, or returns an error response withResponse.Error, the wrapper continues with the next candidate. - Once any non-error chunk has been returned to the caller, later streaming failures are surfaced directly instead of replaying the request on the backup candidate.
Model Hedge
model/hedge provides a wrapper for hedging the time to first meaningful response. It starts the primary candidate immediately, then launches later candidates according to the configured hedge delay. If all currently active candidates have already failed, it launches the next candidate early. The first candidate that produces a meaningful response becomes the winner, and the remaining candidates are canceled.
Quick Start:
hedge.New(...) also returns a regular model.Model, so it can be passed directly to places that accept model.Model, such as llmagent.WithModel(...). This quick start uses the package default delay. Use WithDelay(...) or WithDelays(...) when you need explicit launch scheduling. If the hedge wrapper is used with context-threshold summary or token tailoring and the candidates have different or unknown context windows, set a stable wrapper window with hedge.WithContextWindow(...); otherwise the wrapper reports the shared candidate window only when all candidates expose the same positive value. For a complete example, see examples/model/hedge.
Scheduling And Commit Rules:
- The first candidate always starts immediately when the request begins.
WithDelay(...)defines one fixed interval between successive hedge launches, whileWithDelays(...)can specify explicit launch offsets for candidates1..n.- If all currently active candidates have already failed and there are still candidates that have not started, the wrapper launches the next candidate immediately instead of waiting for the timer.
- Only the first candidate that starts producing meaningful content becomes the winner. Empty or metadata-only responses do not win.
- Once a winner has been committed, the other candidates are canceled and only the winner's stream is forwarded to the caller.
Configuration Examples:
The following snippets show only the scheduling differences.
This configuration means:
candidate[0]starts at0ms.candidate[1]starts at100ms.candidate[2]starts at200ms.
This configuration means:
candidate[0]starts at0ms.candidate[1]starts at80ms.candidate[2]starts at250ms.
WithDelays(...) uses absolute launch offsets from request start, not incremental gaps relative to the previous candidate.
This configuration means:
candidate[0]starts at0ms.candidate[1]starts at0ms.candidate[2]starts at0ms.
This is the all-at-once case where every candidate launches immediately when the request begins. In fixed-interval form, the same setup can be written as WithDelay(0).
ModelSelector
ModelSelector dynamically selects a model for each framework-managed LLM call within the same runner.Run(...).
If an LLMAgent has its own fixed model selection policy, configure it when creating the agent:
Notes:
- When both are configured,
agent.WithModelSelector(...)takes precedence overllmagent.WithModelSelector(...). - If selector returns
nil, nil, the model is not switched and the currentinv.Modelis kept; returning an error terminates the current LLM call.
For a complete example, see examples/model/selector.