Session Summary
Overview
As conversations grow, maintaining complete event history can consume significant memory and may exceed the LLM's context window limit. The session summary feature uses LLM to automatically compress historical conversations into concise summaries, significantly reducing memory usage and token consumption while preserving important context.
Key Features
- Auto-trigger: During summary checks, automatically generates summaries based on event count, token count, or time thresholds
- Incremental processing: Only processes new events since the last summary, avoiding redundant computation
- LLM-driven: Uses any configured LLM model to generate high-quality, context-aware summaries
- Non-destructive: Original events are fully preserved; summaries are stored separately
- Async processing: Executes asynchronously in the background without blocking conversation flow
- Flexible configuration: Supports custom trigger conditions, prompts, and word limits
Basic Configuration
Step 1: Create Summarizer
Create a summarizer with an LLM model and configure trigger conditions:
Step 2: Configure Session Service
Integrate the summarizer into a session service:
Step 3: Configure Agent and Runner
Create an Agent and configure summary injection behavior:
After completing the above configuration, the summary feature runs automatically.
SessionSummarizer Interface
Context-Aware Summary Checks
The released SessionSummarizer interface stays unchanged.
When summary gating depends on request context, use ContextChecker with the
context-aware check options:
The framework does not reserve any context keys for summary triggering. If your
application needs to distinguish different summary entry points, annotate the
context before calling the session APIs and read the value inside your
ContextChecker.
Summarizer Options
Trigger Conditions
| Option | Description |
|---|---|
WithEventThreshold(eventCount int) |
Trigger when event count since last summary exceeds threshold |
WithTokenThreshold(tokenCount int) |
Trigger when token count since last summary exceeds threshold |
WithTimeThreshold(interval time.Duration) |
Evaluated during summary checks; wraps CheckTimeThreshold and triggers when the checked session's last event is older than the interval |
Combined Conditions
| Option | Description |
|---|---|
WithChecksAll(checks ...Checker) |
All conditions must be met (AND logic), use Check* functions |
WithChecksAny(checks ...Checker) |
Any condition triggers (OR logic), use Check* functions |
WithChecksAllContext(checks ...ContextChecker) |
All request-scoped conditions must be met (AND logic) |
WithChecksAnyContext(checks ...ContextChecker) |
Any request-scoped condition triggers (OR logic) |
ContextChecker receives (ctx context.Context, sess *session.Session).
Note: Use Check* functions (for example CheckEventThreshold) inside
WithChecksAll and WithChecksAny, not With* functions.
Summary Generation
| Option | Description |
|---|---|
WithMaxSummaryWords(maxWords int) |
Limit summary word count; included in prompt to guide model |
WithPrompt(prompt string) |
Custom summary prompt; must contain {conversation_text} placeholder |
WithSystemPrompt(prompt string) |
Add a separate system message for summarization instructions; must not contain {conversation_text} |
WithSkipRecent(skipFunc SkipRecentFunc) |
Custom function to skip recent events |
Hook Options
| Option | Description |
|---|---|
WithPreSummaryHook(h PreSummaryHook) |
Pre-summary hook; can modify input text |
WithPostSummaryHook(h PostSummaryHook) |
Post-summary hook; can modify output summary |
WithSummaryHookAbortOnError(abort bool) |
Whether to abort on hook error; default false (ignore errors) |
Tool Call Formatting
By default, the summarizer includes tool calls and tool results in the conversation text sent to the LLM for summarization. The default format is:
- Tool calls:
[Called tool: toolName with args: {"arg": "value"}] - Tool results:
[toolName returned: result content]
| Option | Description |
|---|---|
WithToolCallFormatter(f ToolCallFormatter) |
Customize how tool calls are formatted in summary input. Return empty string to exclude |
WithToolResultFormatter(f ToolResultFormatter) |
Customize how tool results are formatted in summary input. Return empty string to exclude |
Model Callbacks (Before/After Model)
The summarizer supports model callbacks around the underlying model.GenerateContent call, useful for modifying requests, short-circuiting with custom responses, or instrumentation.
| Option | Description |
|---|---|
WithModelCallbacks(callbacks *model.Callbacks) |
Register Before/After callbacks for the summarizer's underlying model calls |
Checker Functions
Checker is a function type for determining whether to trigger summarization:
Built-in Checkers
| Checker | Description |
|---|---|
CheckEventThreshold(eventCount int) |
Returns true when the number of delta events since the last summary exceeds the threshold |
CheckTimeThreshold(interval time.Duration) |
Returns true when the checked session's last event is older than the interval |
CheckTokenThreshold(tokenCount int) |
Returns true when the estimated token count of delta events since the last summary exceeds the threshold (estimated via TokenCounter from extracted conversation text, not event.Response.Usage.TotalTokens) |
ChecksAll(checks []Checker) |
Combines multiple Checkers; returns true only when all return true (AND) |
ChecksAny(checks []Checker) |
Combines multiple Checkers; returns true when any returns true (OR) |
Custom Prompt
Required placeholders:
{conversation_text}: Must be included; replaced with conversation content{max_summary_words}: Must be included in eitherWithPrompt(...)orWithSystemPrompt(...)whenmaxSummaryWords > 0
If you want to keep summarization instructions in a dedicated system message,
combine WithSystemPrompt with a lighter user prompt that only carries the
conversation payload:
Notes:
WithPromptstill renders into the user messageWithSystemPromptrenders into a dedicated system messageWithSystemPromptmust not include{conversation_text}; keep conversation content in the user prompt
Token Counter Configuration
By default, CheckTokenThreshold uses a built-in SimpleTokenCounter that estimates token count based on text length. To customize token counting behavior, use summary.SetTokenCounter to set a global token counter:
Notes:
- Global effect:
SetTokenCounteraffects allCheckTokenThresholdevaluations in the current process; set it once during application initialization - Default counter: If not set, the default
SimpleTokenCounteris used (approximately 4 characters per token)
Skip Recent Events
Use WithSkipRecent to skip recent events during summarization:
Summary Hooks
PreSummaryHook
Called before summary generation; can modify input text or events:
PostSummaryHook
Called after summary generation; can modify the output summary:
Usage Example
Summary Trigger Mechanism
Automatic Trigger (Recommended)
The Runner automatically checks trigger conditions after each conversation completes, generating summaries asynchronously in the background when conditions are met.
Trigger timing:
- Event count exceeds threshold (
WithEventThreshold) - Token count exceeds threshold (
WithTokenThreshold) - On a summary check, the checked session's last event is older than the interval (
WithTimeThreshold) - Custom combined conditions met (
WithChecksAny/WithChecksAll)
WithTimeThreshold is not a standalone background timer. The condition is only evaluated when a summary check runs, typically after a conversation turn completes or when you call summary APIs manually. It checks the last event of the session being evaluated; in the Runner's normal delta-summary flow, that session contains only pending events, so this effectively means the latest unsummarized event. For example, 5*time.Minute means "on the next summary check, if the checked session's last event is already older than 5 minutes, summarize now."
Manual Trigger
In some scenarios, you may need to manually trigger summarization:
API description:
EnqueueSummaryJob: Async summary (recommended)- Background processing, non-blocking
- Auto-fallback to sync on failure
- Suitable for production
CreateSessionSummary: Sync summary- Immediate processing, blocks current operation
- Returns result directly
- Suitable for debugging or when immediate results are needed
Parameter description:
- filterKey:
session.SummaryFilterKeyAllContentsgenerates a summary for the full session - force parameter:
false: Respects configured trigger conditions; only generates summary when conditions are mettrue: Forces summary generation, completely ignoring all trigger condition checks
Use cases:
| Scenario | Recommended API | force |
|---|---|---|
| Normal conversation flow | Auto-trigger (no call needed) | - |
| Background batch processing | EnqueueSummaryJob |
false |
| User-initiated request | EnqueueSummaryJob |
true |
| Debug/Test | CreateSessionSummary |
true |
| Session end | EnqueueSummaryJob |
true |
Context Injection Mechanism
The framework provides two modes for managing conversation context sent to the LLM:
Mode 1: Enable Summary Injection (Recommended)
How it works:
- Session summary is merged into the existing system message if one exists, or prepended as a new system message if none exists
- This ensures compatibility with models that require a single system message at the beginning (e.g., Qwen3.5 series)
- Includes all incremental events after the summary point (no truncation)
- Guarantees complete context: compressed history + full new conversation
WithMaxHistoryRunsparameter is ignored
Summary Injection Mode
By default, the summary is injected as a system message (merged into the existing system prompt). In this mode, the summary is protected by token tailoring's preserved head and will not be trimmed by the sliding window.
To allow the summary to participate in token-budget trimming for a true sliding-window experience, switch the injection mode to user:
Injection mode comparison:
| Mode | Injection Position | Token Tailoring Behavior | Use Case |
|---|---|---|---|
SessionSummaryInjectionSystem (default) |
Merged into system message | Summary in preserved head, never trimmed | Summary must always be present |
SessionSummaryInjectionUser |
Merged into the first user history/current message when possible; otherwise inserted near history | Summary participates in round trimming, can be evicted | Sliding window for very long conversations |
User mode message structure:
When the first history message is a user role, the summary is merged into it:
When the first history message is not a user role, the summary is a standalone user message:
Notes:
- In user mode, the processor first tries to merge the summary into the first user history/current message so it stays attached to the live user turn
- If there is no user history/current message and the prompt prefix already ends with a user message (for example, injected context), the summary falls back to that trailing user message instead of adding another adjacent user block
- User mode uses a more neutral default template ("Context from previous interactions") to avoid system-instruction tone in a user role message
- Custom
WithSummaryFormatteralso applies to user mode - The summary generation pipeline is unaffected — injection mode only changes prompt assembly, not the summarizer itself
Tip: For very long conversations (hundreds of turns) where you want old summaries to naturally age out (replaced by newer summaries), use
SessionSummaryInjectionUsermode.
Optional: Prompt-side context compaction
When WithEnableContextCompaction(true) is enabled, the framework adds two compaction passes before the LLM call:
Pass 1 — Historical tool result placeholder (ContextCompactionToolResultMaxTokens, default 1024 tokens):
- Tool results from older requests that exceed the threshold are replaced entirely with a short placeholder while keeping
ToolIDandToolName - The current request and the latest
ContextCompactionKeepRecentRequestscompleted requests are never affected - This cleans up accumulated long tool outputs from earlier conversation turns
Pass 2 — Oversized tool result truncation (ContextCompactionOversizedToolResultMaxTokens, default 0 / disabled):
- Applies to all tool results including the current request
- Tool results exceeding this threshold are truncated using head+tail preservation: the beginning and end of the content are kept, with a
[...N characters truncated...]marker in the middle - This is the safety net for single tool results large enough to overflow the context window on their own (e.g.
web_fetchreturning 800K+ chars of HTML)
The two passes have different roles: Pass 1 aggressively cleans old history (low threshold, full replacement); Pass 2 is a high-threshold guard that only kicks in for extreme cases but protects the current request too.
Pass 2 is disabled by default (0). It only fires when both (1) WithEnableContextCompaction(true) is set and (2) ContextCompactionOversizedToolResultMaxTokens > 0 (recommended opt-in value: 8192, exposed as the constant processor.DefaultContextCompactionOversizedToolResultMaxTokens). This guarantees that EnableContextCompaction=false always means "the framework will not modify any tool result".
Additionally:
- If
WithAddSessionSummary(true)is also enabled and the rebuilt request still approaches the model context window, the framework performs one synchronousCreateSessionSummary(...)retry before calling the model - Model-layer token tailoring remains the final fallback
Context structure:
Model Compatibility:
Some LLM providers have strict requirements for system message placement and count:
- Qwen3.5 series and similar models require the system message to be at the beginning and do not support multiple system messages
- The default merging behavior prevents errors like
System message must be at the beginning - Preloaded memory content is also merged into the system message using the same mechanism
Mode 2: Without Summary
How it works:
- No summary message added
- Only includes the most recent
MaxHistoryRunsconversation turns MaxHistoryRuns=0means no limit, includes all history- If
WithEnableContextCompaction(true)is enabled, oversized tool results in older retained requests can still be compacted during request projection (Pass 1). If you also explicitly setWithContextCompactionOversizedToolResultMaxTokens(8192)(or another positive value), extremely large tool results in any request (including the current one) will be head+tail truncated (Pass 2). Both passes require theEnableContextCompaction=truemaster switch. - The pre-LLM synchronous summary retry is disabled in this mode
Context structure:
Mode Selection Guide
| Scenario | Recommended Config | Description |
|---|---|---|
| Long sessions (support, assistant) | AddSessionSummary=true |
Maintain full context, optimize tokens |
| Short sessions (single consultation) | AddSessionSummary=falseMaxHistoryRuns=10 |
Simple and direct, no summary overhead |
| Debug/Test | AddSessionSummary=falseMaxHistoryRuns=5 |
Quick validation, reduce noise |
| High concurrency | AddSessionSummary=trueIncrease worker count |
Async processing, no impact on response speed |
If your long sessions frequently contain large tool outputs such as search results, logs, or code scan output, enable EnableContextCompaction=true. Pair it with AddSessionSummary=true when you also want the pre-LLM synchronous summary retry.
Tip: If your agent uses tools like
web_fetchthat can return extremely large results in a single call,ContextCompactionOversizedToolResultMaxTokensis particularly valuable — it prevents a single tool result from consuming the entire context window, even when that result belongs to the current (protected) request. It is disabled by default; opt in by enablingWithEnableContextCompaction(true)and passing a positive threshold (recommended:8192).
Summary Format Customization
By default, session summaries are formatted with context tags and a note about prioritizing current conversation information:
Default format:
You can use WithSummaryFormatter to customize the summary format:
Use cases:
- Simplified format: Use concise titles and minimal context hints to reduce token consumption
- Language localization: Translate context hints to the target language
- Role-specific format: Provide different formats for different Agent roles
- Model optimization: Adjust format based on specific model preferences
Retrieving Summaries
Filter Key support:
- When no option is provided, returns the full session summary (
SummaryFilterKeyAllContents) - When a specific filter key is provided but not found, falls back to the full session summary
- If neither exists, falls back to any available summary
Summary by Event Type
In practice, you may want to generate independent summaries for different types of events.
Setting FilterKey with AppendEventHook
FilterKey Prefix Convention
Important: FilterKey must include the appName + "/" prefix.
Reason: The Runner uses appName + "/" as the filter prefix when filtering events. If the FilterKey doesn't have this prefix, events will be filtered out.
Generating Summaries by Type
How It Works
- Incremental processing: The summarizer tracks the last summary time for each session; subsequent runs only process events after the last summary
- Incremental summary: New events are combined with the previous summary to generate an updated summary containing both old context and new information
- Trigger condition evaluation: Before generating a summary, configured trigger conditions are evaluated. If conditions are not met and
force=false, summarization is skipped - Async workers: Summary tasks are distributed to multiple worker goroutines using a hash-based distribution strategy, ensuring tasks for the same session are processed in order
- Fallback mechanism: If async enqueue fails (queue full, context cancelled, or workers not initialized), the system automatically falls back to synchronous processing
Best Practices
- Choose appropriate thresholds: Set event/token thresholds based on the LLM's context window and conversation patterns. For GPT-4 (8K context), consider
WithTokenThreshold(4000)to leave room for responses - Use async processing: Always use
EnqueueSummaryJobinstead ofCreateSessionSummaryin production to avoid blocking conversation flow - Monitor queue size: If you frequently see "queue is full" warnings, increase
WithSummaryQueueSizeorWithAsyncSummaryNum - Customize prompts: Tailor summary prompts to your application needs. For example, if building a customer support Agent, focus on key issues and solutions
- Balance word limits: Set
WithMaxSummaryWordsto balance context preservation and token usage. Typical range is 100-300 words - Test trigger conditions: Experiment with different
WithChecksAnyandWithChecksAllcombinations to find the optimal balance between summary frequency and cost
Performance Considerations
- LLM cost: Each summary generation calls the LLM. Monitor trigger conditions to balance cost and context preservation
- Memory usage: Summaries are stored alongside events. Configure appropriate TTL to manage memory in long-running sessions
- Async workers: More workers increase throughput but consume more resources. Start with 2-4 workers and scale based on load
- Queue capacity: Adjust queue size based on expected concurrency and summary generation time
Complete Example
Here is a complete example demonstrating how all components work together: