Session Summary
Overview
As conversations grow, maintaining complete event history can consume significant memory and may exceed the LLM's context window limit. The session summary feature uses LLM to automatically compress historical conversations into concise summaries, significantly reducing memory usage and token consumption while preserving important context.
Key Features
- Auto-trigger: During summary checks, automatically generates summaries based on event count, token count, or time thresholds
- Incremental processing: Only processes new events since the last summary, avoiding redundant computation
- LLM-driven: Uses any configured LLM model to generate high-quality, context-aware summaries
- Non-destructive: Original events are fully preserved; summaries are stored separately
- Async processing: Executes asynchronously in the background without blocking conversation flow
- Flexible configuration: Supports custom trigger conditions, prompts, and word limits
Basic Configuration
Step 1: Create Summarizer
Create a summarizer with an LLM model and configure trigger conditions:
Step 2: Configure Session Service
Integrate the summarizer into a session service:
Step 3: Configure Agent and Runner
Create an Agent and configure summary injection behavior:
After completing the above configuration, the summary feature runs automatically.
Cache-Safe Summary Forking
By default, the summarizer sends a standalone summary request: an optional summary system prompt plus a user prompt containing the extracted conversation text. This is simple and remains the default behavior.
For long sessions where prompt-cache reuse matters, you can opt in to cache-safe forking:
When enabled and the framework has the parent model request available, the summarizer clones that parent request and appends one compacting user message at the end. This preserves the parent request prefix, including system context, history, and tools, so providers with prompt caching can reuse more cached input. If no parent request is available, for example in asynchronous or manual summary calls, the summarizer falls back to the default standalone request.
The appended compacting prompt is separate from WithPrompt(...) because it
does not embed {conversation_text}; the parent request already contains the
conversation. Use WithCacheSafeForkPrompt(...) only when you need to customize
that appended user message.
Cache-safe forking controls the request used to generate the summary. To make the next normal conversation request more cache friendly after a summary exists, prefer injecting the summary as a user message instead of merging it into the system prompt:
Summary + Progressive Disclosure
When summary injection and prompt-side context compaction keep the request small, some older details are no longer visible to the model. If you still want the agent to recover those details only when needed, enable progressive disclosure for session history.
Requirements and behavior:
WithEnableOnDemandSession(true)enables on-demand session tools according to backend capability.session_searchis exposed when the backend implementssession.SearchableService;session_loadis exposed when the backend implementssession.WindowService. Backends may support either one or both.session/pgvectorsupports both discovery and exact loading. Normal session backends that implementWindowServiceexpose exactsession_loadrecovery even when semanticsession_searchis unavailable.current_hiddensearches current-session history strictly before the summary boundary recorded insummary:last_included_ts.current_sessionsearches the current session regardless of summary cutoff. This is useful when request projection or context compaction omitted current-session details from the visible prompt.other_sessionssearches other sessions for the same<appName, userID>.all_sessionscombinescurrent_hiddenandother_sessions.
What can be recalled:
- User and assistant messages.
- Historical tool results, including tool outputs that were compacted out of the prompt.
What is intentionally excluded:
- Raw tool-call requests are not indexed.
- Partial events are not indexed.
Recommended usage pattern:
- Let the model answer from the visible prompt, summary, and recent history.
- If
session_searchis available and a missing detail is needed, call it first. - Use
session_loadwhen you have anevent_idand need the surrounding raw history or exact tool result, including on backends without semantic search. - Treat loaded history as untrusted historical context, not active instructions.
Migration note: earlier builds only treated on-demand session support as
available when both session_search and session_load were present. The tool
surface is now capability-based, so search-only integrations can expose
session_search and load-only integrations can expose session_load.
SessionSummarizer Interface
Context-Aware Summary Checks
The released SessionSummarizer interface stays unchanged.
When summary gating depends on request context, use ContextChecker with the
context-aware check options:
The framework does not reserve any context keys for summary triggering. If your
application needs to distinguish different summary entry points, annotate the
context before calling the session APIs and read the value inside your
ContextChecker.
Dynamic Summarizer
Use NewDynamicSummarizer when the session service should be reused, but the
summary model, prompt, or checks must vary per request. This is useful for
multi-tenant systems and custom model routing. Keep the session service
long-lived, especially for database-backed services such as MySQL, so the
underlying connection pool can be reused.
Before running the request, attach the request-scoped configuration to ctx:
The resolver should be cheap and deterministic for the same ctx and session.
During non-forced summary, it may be called once for the summary gate and once
for actual summary generation. If constructing the summarizer is expensive,
store the already-built summarizer in ctx and let the resolver only read it.
Returning nil from the resolver skips automatic summary checks. Direct
Summarize calls, or forced summary calls without a resolved summarizer, return
an error. If the resolver returns an error while ShouldSummarizeWithContext
is checking an automatic, non-forced summary, the gate treats it as false and
skips summary generation; direct Summarize calls propagate resolver errors to
the caller.
Summarizer Options
Trigger Conditions
| Option | Description |
|---|---|
WithEventThreshold(eventCount int) |
Trigger when event count since last summary exceeds threshold |
WithTokenThreshold(tokenCount int) |
Trigger when token count since last summary exceeds threshold |
WithContextThreshold(opts ...ContextThresholdOption) |
Trigger when token count since last summary exceeds a ratio of the current model's context window |
WithTimeThreshold(interval time.Duration) |
Evaluated during summary checks; wraps CheckTimeThreshold and triggers when the checked session's last event is older than the interval |
Use WithTokenThreshold when you want a fixed application-defined token
threshold, for example "summarize after 4000 new tokens" regardless of which
model is serving the request. The threshold is captured in the summarizer
configuration and does not change when your application switches models.
Use WithContextThreshold when the summary trigger should follow the active
model's context window. This is the recommended option for agents that can
switch models within a session. At summary-check time, the framework resolves
the context window in this order:
- Per-run override from
agent.WithModelContextWindow(tokens) - Model instance configuration from providers such as
openai.WithContextWindow(tokens)orprovider.WithContextWindow(tokens) - Process-wide model-name registry from
model.RegisterModelContextWindow(name, tokens)
The threshold is then computed as contextWindow * ratio (default 50%). For
private deployments, endpoint IDs, fine-tuned models, newly released models, or
multi-tenant custom model configuration, prefer the instance or per-run option
so different users do not overwrite a process-wide registry entry:
Use global registration only when the model name has a stable process-wide meaning:
Combined Conditions
| Option | Description |
|---|---|
WithChecksAll(checks ...Checker) |
All conditions must be met (AND logic), use Check* functions |
WithChecksAny(checks ...Checker) |
Any condition triggers (OR logic), use Check* functions |
WithChecksAllContext(checks ...ContextChecker) |
All request-scoped conditions must be met (AND logic) |
WithChecksAnyContext(checks ...ContextChecker) |
Any request-scoped condition triggers (OR logic) |
ContextChecker receives (ctx context.Context, sess *session.Session).
Note: Use Check* functions (for example CheckEventThreshold) inside
WithChecksAll and WithChecksAny, not With* functions.
Summary Generation
| Option | Description |
|---|---|
WithMaxSummaryWords(maxWords int) |
Limit summary word count; included in prompt to guide model |
WithPrompt(prompt string) |
Custom summary prompt; must contain {conversation_text} placeholder |
WithSystemPrompt(prompt string) |
Add a separate system message for summarization instructions; must not contain {conversation_text} |
WithCacheSafeForking(enable bool) |
Opt in to cache-safe summary request forking when a parent request is available. Disabled by default |
WithCacheSafeForkPrompt(prompt string) |
Customize the compacting user message appended in cache-safe fork mode. May include {max_summary_words}, but not {conversation_text} |
WithSkipRecent(skipFunc SkipRecentFunc) |
Custom function to skip recent events |
Hook Options
| Option | Description |
|---|---|
WithPreSummaryHook(h PreSummaryHook) |
Pre-summary hook; can modify input text |
WithPostSummaryHook(h PostSummaryHook) |
Post-summary hook; can modify output summary |
WithSummaryHookAbortOnError(abort bool) |
Whether to abort on hook error; default false (ignore errors) |
Tool Call Formatting
By default, the summarizer includes tool calls and tool results in the conversation text sent to the LLM for summarization. The default format is:
- Tool calls:
[Called tool: toolName with args: {"arg": "value"}] - Tool results:
[toolName returned: result content]
| Option | Description |
|---|---|
WithToolCallFormatter(f ToolCallFormatter) |
Customize how tool calls are formatted in summary input. Return empty string to exclude |
WithToolResultFormatter(f ToolResultFormatter) |
Customize how tool results are formatted in summary input. Return empty string to exclude |
Model Callbacks (Before/After Model)
The summarizer supports model callbacks around the underlying model.GenerateContent call, useful for modifying requests, short-circuiting with custom responses, or instrumentation.
| Option | Description |
|---|---|
WithModelCallbacks(callbacks *model.Callbacks) |
Register Before/After callbacks for the summarizer's underlying model calls |
Checker Functions
Checker is a function type for determining whether to trigger summarization:
Built-in Checkers
| Checker | Description |
|---|---|
CheckEventThreshold(eventCount int) |
Returns true when the number of delta events since the last summary exceeds the threshold |
CheckTimeThreshold(interval time.Duration) |
Returns true when the checked session's last event is older than the interval |
CheckTokenThreshold(tokenCount int) |
Returns true when the estimated token count of delta events since the last summary exceeds the threshold (estimated via TokenCounter from extracted conversation text, not event.Response.Usage.TotalTokens) |
ChecksAll(checks []Checker) |
Combines multiple Checkers; returns true only when all return true (AND) |
ChecksAny(checks []Checker) |
Combines multiple Checkers; returns true when any returns true (OR) |
Custom Prompt
Required placeholders:
{conversation_text}: Must be included; replaced with conversation content{max_summary_words}: Must be included in eitherWithPrompt(...)orWithSystemPrompt(...)whenmaxSummaryWords > 0
If you want to keep summarization instructions in a dedicated system message,
combine WithSystemPrompt with a lighter user prompt that only carries the
conversation payload:
Notes:
WithPromptstill renders into the user messageWithSystemPromptrenders into a dedicated system messageWithSystemPromptmust not include{conversation_text}; keep conversation content in the user prompt
Token Counter Configuration
By default, CheckTokenThreshold uses a built-in SimpleTokenCounter that estimates token count based on text length. To customize token counting behavior, use summary.SetTokenCounter to set a global token counter:
For SimpleTokenCounter, WithApproxRunesPerToken(v) means roughly v UTF-8 characters per token. The formula is estimatedTokens = countedUTF8Runes / v. For example, v=1.5 means about 1.5 characters per token; do not treat it as a token multiplier.
Notes:
- Global effect:
SetTokenCounteraffects allCheckTokenThresholdevaluations in the current process; set it once during application initialization - Default counter: If not set, the default
SimpleTokenCounteris used (approximately 4 characters per token) - Parameter meaning:
vinWithApproxRunesPerToken(v)is characters per token. Passing2.0/3.0means about0.67characters per token, which is about1.5tokens per character
Skip Recent Events
Use WithSkipRecent to skip recent events during summarization:
Summary Hooks
PreSummaryHook
Called before summary generation; can modify input text or events:
PostSummaryHook
Called after summary generation; can modify the output summary:
Usage Example
Summary Trigger Mechanism
Automatic Trigger (Recommended)
The Runner automatically checks trigger conditions after each conversation completes, generating summaries asynchronously in the background when conditions are met.
Trigger timing:
- Event count exceeds threshold (
WithEventThreshold) - Token count exceeds threshold (
WithTokenThreshold) - Token count exceeds the configured ratio of the active model's context window (
WithContextThreshold) - On a summary check, the checked session's last event is older than the interval (
WithTimeThreshold) - Custom combined conditions met (
WithChecksAny/WithChecksAll)
WithTimeThreshold is not a standalone background timer. The condition is only evaluated when a summary check runs, typically after a conversation turn completes or when you call summary APIs manually. It checks the last event of the session being evaluated; in the Runner's normal delta-summary flow, that session contains only pending events, so this effectively means the latest unsummarized event. For example, 5*time.Minute means "on the next summary check, if the checked session's last event is already older than 5 minutes, summarize now."
Manual Trigger
In some scenarios, you may need to manually trigger summarization:
API description:
EnqueueSummaryJob: Async summary (recommended)- Background processing, non-blocking
- Auto-fallback to sync on failure
- Suitable for production
CreateSessionSummary: Sync summary- Immediate processing, blocks current operation
- Returns result directly
- Suitable for debugging or when immediate results are needed
Parameter description:
- filterKey:
session.SummaryFilterKeyAllContentsgenerates a summary for the full session - force parameter:
false: Respects configured trigger conditions; only generates summary when conditions are mettrue: Forces summary generation, completely ignoring all trigger condition checks
Use cases:
| Scenario | Recommended API | force |
|---|---|---|
| Normal conversation flow | Auto-trigger (no call needed) | - |
| Background batch processing | EnqueueSummaryJob |
false |
| User-initiated request | EnqueueSummaryJob |
true |
| Debug/Test | CreateSessionSummary |
true |
| Session end | EnqueueSummaryJob |
true |
Context Injection Mechanism
The framework provides two modes for managing conversation context sent to the LLM:
Before choosing a mode, distinguish the three context-reduction mechanisms:
| Mechanism | Layer | What changes | Typical use |
|---|---|---|---|
| Summary | Session Service + prompt assembly | Uses an LLM to create a persisted summary of historical events. With WithAddSessionSummary(true), the request injects that summary and appends only incremental events after the summary point |
Preserve semantic continuity in long sessions while avoiding repeated full-history prompts |
| Context compaction | Agent prompt assembly | Does not call an LLM and does not drop whole turns. It only rewrites tool result content during request projection, such as replacing old results with placeholders or truncating oversized results with head+tail preservation |
Keep the conversation structure and active tool chain while shrinking large tool outputs |
| Token tailoring | Model provider | Drops or keeps message rounds according to a token budget right before the provider call. The default strategy tries to preserve system messages and the latest turn, but preservation is still limited by the available budget | Final fallback to keep the request within the model context window |
The normal call path is: the agent assembles the prompt, injects the summary when
configured, and optionally compacts tool result content. If summary injection is
enabled and the compacted request still approaches the context window, the flow
may synchronously refresh the summary once and rebuild the request before the LLM
call. Finally, model-layer token tailoring trims the message list by budget. In
short, context compaction and token tailoring can both reduce prompt size, but
compaction shrinks tool-output payloads inside messages, while tailoring drops
message rounds. Summary is different again: it creates a semantic replacement for
historical context.
Mode 1: Enable Summary Injection (Recommended)
How it works:
- Session summary is merged into the existing system message if one exists, or prepended as a new system message if none exists
- This ensures compatibility with models that require a single system message at the beginning (e.g., Qwen3.5 series)
- Includes all incremental events after the summary point (no truncation)
- Guarantees complete context: compressed history + full new conversation
WithMaxHistoryRunsparameter is ignored
Summary Injection Mode
By default, the summary is injected as a system message (merged into the existing system prompt). In this mode, the summary is protected by token tailoring's preserved head and will not be trimmed by the sliding window.
To allow the summary to participate in token-budget trimming for a true sliding-window experience, switch the injection mode to user:
Injection mode comparison:
| Mode | Injection Position | Token Tailoring Behavior | Use Case |
|---|---|---|---|
SessionSummaryInjectionSystem (default) |
Merged into system message | Summary in preserved head, never trimmed | Summary must always be present |
SessionSummaryInjectionUser |
Merged into the first user history/current message when possible; otherwise inserted near history | Summary participates in round trimming, can be evicted | Sliding window for very long conversations |
User mode message structure:
When the first history message is a user role, the summary is merged into it:
When the first history message is not a user role, the summary is a standalone user message:
Notes:
- In user mode, the processor first tries to merge the summary into the first user history/current message so it stays attached to the live user turn
- If there is no user history/current message and the prompt prefix already ends with a user message (for example, injected context), the summary falls back to that trailing user message instead of adding another adjacent user block
- User mode uses a more neutral default template ("Context from previous interactions") to avoid system-instruction tone in a user role message
- Custom
WithSummaryFormatteralso applies to user mode - The summary generation pipeline is unaffected — injection mode only changes prompt assembly, not the summarizer itself
Tip: For very long conversations (hundreds of turns) where you want old summaries to naturally age out (replaced by newer summaries), use
SessionSummaryInjectionUsermode.
Context Compaction Details
Context compaction is not another name for summary, and it is not token
tailoring. It only targets tool result content, which is the part most likely
to grow unexpectedly. It does not summarize ordinary user/assistant messages
with an LLM, and it does not discard complete message rounds the way token
tailoring may.
Naming note: "compaction" in
WithEnableContextCompaction(true)means prompt-side tool result compaction/pruning. Semantic summaries are still controlled byWithAddSessionSummary(true)and the configured session summarizer.
When WithEnableContextCompaction(true) is enabled, the framework adds two compaction passes before the LLM call:
Pass 1 — Historical tool result placeholder (ContextCompactionToolResultMaxTokens, default 1024 tokens):
- Tool results from older requests that exceed the threshold are replaced entirely with a short placeholder while keeping
ToolIDandToolName - The current request and the latest
ContextCompactionKeepRecentRequestscompleted requests are never affected - If
ToolResultCompactionConfig.SkipRecentFuncreturns a positive number, the request/invocation units that own those tail events are also treated as recent and skipped by Pass 1 - This cleans up accumulated long tool outputs from earlier conversation turns
Pass 2 — Oversized tool result truncation (ContextCompactionOversizedToolResultMaxTokens, default 0 / disabled):
- Applies to nearly all tool results including the current request. Tool results returned by
session_loaditself are skipped so recovered slices are not compacted again - Tool results exceeding this threshold are truncated using head+tail preservation: the beginning and end of the content are kept, with a
[...N characters truncated...]marker in the middle - This is the safety net for single tool results large enough to overflow the context window on their own (e.g.
web_fetchreturning 800K+ chars of HTML)
The two passes have different roles: Pass 1 aggressively cleans old history (low threshold, full replacement); Pass 2 is a high-threshold guard that only kicks in for extreme cases but protects the current request too.
Pass 2 is disabled by default (0). It only fires when both (1) WithEnableContextCompaction(true) is set and (2) ContextCompactionOversizedToolResultMaxTokens > 0 (recommended opt-in value: 8192, exposed as the constant processor.DefaultContextCompactionOversizedToolResultMaxTokens). This guarantees that EnableContextCompaction=false always means "the framework will not modify any tool result".
Use WithToolResultCompactionConfig(...) when you need tool-name or recency policy:
ForceCleanToolNames: historical results from these tools are replaced with a policy placeholder whenever context compaction is enabled, after current/recent protection is applied. This is useful for noisy tools such as shell, grep, or log dump tools.KeepToolNames: results from these tools are left untouched by context compaction. This is useful for recovery tools such assession_loadandsession_searchwhen the model may need to read the exact payload.SkipRecentFunc: customizes how many tail events are considered recent. It affects Pass 0 force-clean and Pass 1 historical classification; Pass 2 can still truncate oversized recent/current tool results.
If the same tool name appears in both ForceCleanToolNames and KeepToolNames, KeepToolNames wins.
When the compacted event has an event_id, placeholders and truncation markers include recovery hints such as event_id, tool_call_id, and tool_name. With WithEnableOnDemandSession(true) and a session backend that implements session.WindowService, the model can call session_load with content_offset / content_limit to reload a precise slice of the original tool result. session_load output size is controlled by its own window parameters and content_limit; reload very large results in slices instead of requesting the full payload at once.
Additionally:
- If
WithAddSessionSummary(true)is also enabled and the rebuilt request still approaches the model context window, the framework performs one synchronousCreateSessionSummary(...)retry before calling the model - Model-layer token tailoring remains the final fallback. It trims whole message rounds, so keep recovered slices small enough that they still fit in the final provider request
- Context compaction uses
SimpleTokenCounterby default. If your application uses a custom counter for CJK-heavy prompts or provider-specific tokenization, pass the same counter withWithContextCompactionTokenCounter(...)so Pass 1 decisions and Pass 2 truncation use the same estimate as token tailoring.
See
examples/context_compaction
for a full example. It calls a real model and prints the exact request sent to
the model by default with -debug=true, which makes it easy to verify whether
large historical tool result payloads were replaced with placeholders.
Context structure:
Model Compatibility:
Some LLM providers have strict requirements for system message placement and count:
- Qwen3.5 series and similar models require the system message to be at the beginning and do not support multiple system messages
- The default merging behavior prevents errors like
System message must be at the beginning - Preloaded memory content is also merged into the system message using the same mechanism
Mode 2: Without Summary
How it works:
- No summary message added
- Only includes the most recent
MaxHistoryRunsconversation turns MaxHistoryRuns=0means no limit, includes all history- If
WithEnableContextCompaction(true)is enabled, oversized tool results in older retained requests can still be compacted during request projection (Pass 1). If you also explicitly setWithContextCompactionOversizedToolResultMaxTokens(8192)(or another positive value), extremely large tool results in any request (including the current one) will be head+tail truncated (Pass 2). Both passes require theEnableContextCompaction=truemaster switch. - The pre-LLM synchronous summary retry is disabled in this mode
Context structure:
Mode Selection Guide
| Scenario | Recommended Config | Description |
|---|---|---|
| Long sessions (support, assistant) | AddSessionSummary=true |
Maintain full context, optimize tokens |
| Short sessions (single consultation) | AddSessionSummary=falseMaxHistoryRuns=10 |
Simple and direct, no summary overhead |
| Debug/Test | AddSessionSummary=falseMaxHistoryRuns=5 |
Quick validation, reduce noise |
| High concurrency | AddSessionSummary=trueIncrease worker count |
Async processing, no impact on response speed |
If your long sessions frequently contain large tool outputs such as search results, logs, or code scan output, enable EnableContextCompaction=true. Pair it with AddSessionSummary=true when you also want the pre-LLM synchronous summary retry.
Tip: If your agent uses tools like
web_fetchthat can return extremely large results in a single call,ContextCompactionOversizedToolResultMaxTokensis particularly valuable — it prevents a single tool result from consuming the entire context window, even when that result belongs to the current (protected) request. It is disabled by default; opt in by enablingWithEnableContextCompaction(true)and passing a positive threshold (recommended:8192).
Summary Format Customization
By default, session summaries are formatted with context tags and a note about prioritizing current conversation information:
Default format:
You can use WithSummaryFormatter to customize the summary format:
Use cases:
- Simplified format: Use concise titles and minimal context hints to reduce token consumption
- Language localization: Translate context hints to the target language
- Role-specific format: Provide different formats for different Agent roles
- Model optimization: Adjust format based on specific model preferences
Retrieving Summaries
Filter Key support:
- When no option is provided, returns the full session summary (
SummaryFilterKeyAllContents) - When a specific filter key is provided but not found, falls back to the full session summary
- If neither exists, falls back to any available summary
Summary by Event Type
In practice, you may want to generate independent summaries for different types of events.
Setting FilterKey with AppendEventHook
FilterKey Prefix Convention
Important: FilterKey must include the appName + "/" prefix.
Reason: The Runner uses appName + "/" as the filter prefix when filtering events. If the FilterKey doesn't have this prefix, events will be filtered out.
Generating Summaries by Type
Restricting Summary Targets
By default, when a non-empty branch FilterKey triggers summarization, the
session service refreshes both that branch summary and the full-session summary
(SummaryFilterKeyAllContents). If some branches do not need summaries, you can
reduce LLM usage with an allowlist and optionally disable the full-session
cascade:
Behavior notes:
WithSummaryFilterAllowlist(...)only controls non-empty branch summary targets. It does not blocksession.SummaryFilterKeyAllContents.WithCascadeFullSessionSummary(...)controls whether a non-empty branch trigger also refreshes the full-session summary.- To keep only full-session summaries from branch-triggered automatic summary, pass an explicit empty allowlist and leave cascade enabled:
mysql.WithSummaryFilterAllowlist("")andmysql.WithSummaryFilterAllowlist()both mean "no branch keys are allowed"; with the default cascade behavior, the full-session summary still refreshes.- If you also set
mysql.WithCascadeFullSessionSummary(false), non-empty branch triggers have no summary target and no summary is generated. - Allowlist matching is hierarchical and segment-aware, not a raw string prefix
check. Internally the framework appends the filter-key delimiter (
"/") to both sides and then checks whether either key is an ancestor/descendant of the other. - Examples:
- Allowing
my-app/toolmatchesmy-app/toolandmy-app/tool/search. - Allowing
my-app/tool/searchalso matchesmy-app/tool. - Allowing
my-app/tooldoes not matchmy-app/toolbox. - Allowing
my-app/tooldoes not matchother-app/tool.
- Allowing
session.SummaryFilterKeyAllContentsremains available for direct full-session summaries even when an allowlist is configured.- Leaving the allowlist unset preserves the legacy behavior and allows every
branch
FilterKeyto trigger summaries. - Passing an explicit empty allowlist blocks branch summary targets; with cascade enabled, branch triggers still refresh the full-session summary.
How It Works
- Incremental processing: The summarizer tracks the last summary time for each session; subsequent runs only process events after the last summary
- Incremental summary: New events are combined with the previous summary to generate an updated summary containing both old context and new information
- Trigger condition evaluation: Before generating a summary, configured trigger conditions are evaluated. If conditions are not met and
force=false, summarization is skipped - Async workers: Summary tasks are distributed to multiple worker goroutines using a hash-based distribution strategy, ensuring tasks for the same session are processed in order
- Fallback mechanism: If async enqueue fails (queue full, context cancelled, or workers not initialized), the system automatically falls back to synchronous processing
Best Practices
- Choose appropriate thresholds: Use
WithContextThresholdfor agents whose model can change at runtime, and useWithTokenThresholdwhen you intentionally want a fixed token budget. For custom or tenant-provided models, prefer per-modelWithContextWindowor per-runagent.WithModelContextWindow; use global registration only for stable process-wide model names - Use async processing: Always use
EnqueueSummaryJobinstead ofCreateSessionSummaryin production to avoid blocking conversation flow - Monitor queue size: If you frequently see "queue is full" warnings, increase
WithSummaryQueueSizeorWithAsyncSummaryNum - Customize prompts: Tailor summary prompts to your application needs. For example, if building a customer support Agent, focus on key issues and solutions
- Balance word limits: Set
WithMaxSummaryWordsto balance context preservation and token usage. Typical range is 100-300 words - Test trigger conditions: Experiment with different
WithChecksAnyandWithChecksAllcombinations to find the optimal balance between summary frequency and cost
Performance Considerations
- LLM cost: Each summary generation calls the LLM. Monitor trigger conditions to balance cost and context preservation
- Memory usage: Summaries are stored alongside events. Configure appropriate TTL to manage memory in long-running sessions
- Async workers: More workers increase throughput but consume more resources. Start with 2-4 workers and scale based on load
- Queue capacity: Adjust queue size based on expected concurrency and summary generation time
Complete Example
Here is a complete example demonstrating how all components work together: