Error Handling
Overview
This document defines the standard error-handling model for graph workflows, Runner completion, subgraph propagation, and A2A transport.
Agent applications usually need two things at the same time:
- A machine-readable error signal for branching, retries, and reporting
- A stable way to keep useful business error details after the run has ended
In practice, those details may come from:
- A local node error
- A child subgraph or sub-agent
- A remote A2A agent
tRPC-Agent-Go provides one standard path for all three.
Design goals
The framework design follows four rules:
- Keep
Response.Erroras the transport-level failure signal, and useevent.IsTerminalError()to decide whether that failure is terminal. - Keep business-visible error collections in graph state.
- Let recoverable errors continue execution without losing the record.
- Let fatal errors still publish fallback business state before the run stops.
Covered Scenarios
This design covers the requirements that often pushed business teams to keep separate node-error helpers:
- collect local node errors, including recoverable ones
- keep business-visible error details after the run ends
- propagate child subgraph or sub-agent errors back to the parent
- let Runner expose fatal-path fallback state on
runner.completion - carry structured A2A task failures back into
Response.Error
If an existing implementation stores node errors in graph state and reads them
back after completion, graph.ExecutionErrorCollector is the framework
equivalent of that pattern.
Document Structure
The main sections are:
- "Managing business error codes" explains the error-code model and ownership.
- "Recommended graph usage" shows framework integration in graph workflows.
- "Reading errors after the run" defines the Runner-side consumption pattern.
- Read the subgraph and A2A sections only if your system crosses those boundaries.
Responsibility Split
The framework owns transport, propagation, and collection mechanics.
Framework responsibilities:
- where transport failures live:
Response.Error - when a transport failure is terminal:
event.IsTerminalError() - where business-visible records live: graph state
- how fatal fallback state reaches
runner.completion - how child fallback state is separated from normal child completion
- how structured A2A task failures become
Response.Erroragain
Business code decides:
- which error codes exist
- which codes are recoverable
- which fallback route should run
- whether multiple records should be deduplicated or aggregated
- how errors should be persisted, alerted, or reported
The framework does not define a global business error-code registry. It standardizes how structured codes are carried and collected. The code namespace itself remains application-specific.
Managing business error codes
The framework supports error codes as a transport and normalization mechanism. It does not define a centralized business registry.
Summary:
- the framework does not own your business error-code catalog
model.ResponseError.Codeis astring, so collected and transported codes are represented as strings- existing integer-style codes are still supported and converted into decimal strings automatically
- for new business errors, prefer stable string code constants
Code representation
model.ResponseError.Code is defined as *string.
This representation is intentional:
- event streams and A2A metadata are easier to keep stable with string values
- string codes support namespaced business identifiers such as
ORDER_INVENTORY_SOFT_TIMEOUT - cross-language or cross-service integrations do not need to guess numeric ranges or enum ownership
If a system already uses numeric codes, they remain supported. The framework converts them into strings at the transport boundary.
Supported error conventions
By default, graph.NewExecutionError(...) uses
model.ResponseErrorFromError(err, model.ErrorTypeFlowError).
Go does not support overloaded methods. The table below describes alternative conventions across different error types. A single concrete error type would normally implement one code convention, not all of them.
| Optional method on your error type | Framework behavior |
|---|---|
ErrorType() string |
fills ResponseError.Type |
ErrorCode() string |
fills ResponseError.Code directly |
Code() string |
fills ResponseError.Code directly |
Code() int |
converts the value to a decimal string |
Code() int32 |
converts the value to a decimal string |
Code() int64 |
converts the value to a decimal string |
Recommended default for new business code
The recommended pattern is:
- Keep stable string error codes in a small domain package.
- Return typed business errors from nodes, tools, or agents.
- Let the collector record those codes automatically.
- Let the default collector policy recover errors whose
Recoverable() boolmethod returnstrue. - Use
WithExecutionErrorPolicy(...)only for custom fallback routing or optional normalization.
Example business error package:
If you already have a legacy numeric-code system, it still works:
That error will be stored as a string code such as "40401" inside
ResponseError.Code.
Because ordererrors.Error implements Recoverable() bool, the default
collector policy already treats NewInventorySoftTimeout(...) as recoverable.
Example collector policy that keeps the default judgment and adds a custom fallback route:
If your internal errors are messy or wrapped by third-party libraries, use
ExecutionErrorPolicy.ResponseError to normalize them into one business-facing
shape before the record is stored.
Core building blocks
graph.ExecutionError
graph.ExecutionError is the normalized business record stored in state.
It contains:
Severity:recoverableorfatalNodeID/NodeName/NodeTypeStepNumberTimestampError: a structured*model.ResponseError
graph.ExecutionErrorCollector
graph.ExecutionErrorCollector is the recommended framework helper.
It gives you:
- A ready-to-use state field and reducer
- A node callback that records recoverable and fatal errors
- A subgraph output mapper for propagating child errors back to the parent
graph.EmitCustomStateDelta
Fatal errors have a special problem: the graph may stop before it emits the
normal final graph.execution snapshot.
graph.EmitCustomStateDelta(...) solves that problem. It emits a custom event
with business state delta immediately, so downstream consumers can still see the
fallback state on the error path.
The execution error collector uses this helper automatically for fatal errors.
Recommended graph usage
1. Add a state field
This adds the default key graph.StateKeyExecutionErrors.
If you want a custom key:
2. Register the collector callbacks
This is the simplest framework-level setup. Any node error that reaches
AfterNode will now be recorded in the collector field.
3. Decide which errors are recoverable
graph.NewExecutionErrorCollector() now ships with a conservative default
policy:
- errors that implement
Recoverable() booland returntrueare recoverable - errors wrapped with
graph.MarkRecoverable(err)or created withgraph.NewRecoverableError(...)are also recoverable
Example:
If you want to extend that default rule, use
graph.WithRecoverableExecutionErrors(...):
When Recover is true, the collector writes a recoverable record into
state and keeps the graph running.
4. Optionally provide a replacement result
If a recoverable error should continue with a custom state update or route, use
ExecutionErrorPolicy.Replacement.
Preferred replacement types:
graph.State*graph.Command
If Replacement is nil, the collector keeps the original graph.State or
*graph.Command result and merges execution_errors into it automatically.
If you need a custom replacement and still want the default recoverable
judgment, start from graph.DefaultExecutionErrorPolicy(...):
5. Complete graph setup example
If you want one copy-pasteable reference for a normal graph integration, start with this shape:
Reading errors after the run
Graph-only consumers
If the run reaches its normal end, read the collector key from the final
graph.execution event.
Runner consumers
If a fatal error stops the graph before graph.execution, Runner now copies
the fallback business state onto the final runner.completion event and also
attaches the terminal Response.Error there.
That means application code can use one simple rule:
- keep consuming until
runner.completion - read the collector key from its
StateDelta - use
event.IsTerminalError()to find terminal failures, then readResponse.Error
Complete Runner-side pattern:
Subgraphs and sub-agents
There are two different needs here.
Live observation during child execution
Use graph.WithAgentNodeEventCallback(...) or graph-level node callbacks with
RegisterAgentEvent(...) when you want streaming observation of child events.
This is for:
- live SSE dashboards
- logging
- metrics
It is observational. It is not the recommended place to persist final state.
Final child-to-parent propagation
Use the collector's subgraph mapper:
This works for both:
- normal child completion (
graph.execution) - fatal child fallback state emitted before the child stops
For custom mappers, the child result now keeps those two cases separate:
SubgraphResult.FinalStateandSubgraphResult.RawStateDeltaare only for the normal terminalgraph.executionsnapshotSubgraphResult.FallbackStateandSubgraphResult.FallbackStateDeltaare only for fatal child fallback state
If you intentionally want one code path for both, use:
SubgraphResult.EffectiveState()SubgraphResult.EffectiveStateDelta()
ExecutionErrorCollector.SubgraphOutputMapper() already does this for you.
If you need custom parent-side state in addition to child error propagation,
compose your own mapper around collector.SubgraphStateUpdate(result):
A2A structured errors
Server side
If your A2A server should expose agent business errors as structured task failures, enable:
With this option enabled:
- unary A2A responses return a failed
Task - streaming A2A responses emit a failed
TaskStatusUpdateEvent - structured error fields are preserved in task metadata
- only terminal errors become failed tasks; intermediate graph events such as
graph.node.errorcontinue to flow as graph observability events
The payload is intentionally split into two layers:
- outer
Task.MetadataorTaskStatusUpdateEvent.Metadata: the preferred machine-readable error fields such aserror_type,error_code,error_message,task_state, andllm_response_id Status.Message.Metadata: the same machine-readable fields mirrored for A2A interaction spec0.1compatibilityStatus.Message.Parts: user-facing text to display directly
For example, a streaming terminal failure is shaped like this:
That leads to a simple business-side rule:
- branch on
status.state - read structured fields from outer metadata first
- accept
status.message.metadataas a compatibility mirror for legacy consumers - treat
status.message.partsas display text, not as the primary source for machine branching
Client side
A2AAgent recognizes those structured task failures automatically.
For failed, rejected, or canceled remote tasks, it now emits a normal
event.Event with:
Response.Object = "error"Response.Error.TypeResponse.Error.MessageResponse.Error.Codewhen available
In streaming mode, A2AAgent also stops emitting the synthetic final assistant
message after a terminal task error. This avoids the ambiguous pattern of
"error first, then normal final message".
In other words, the default client path already follows the same rule:
- outer metadata is the preferred source to rebuild
Response.Error status.message.metadataremains a compatibility mirror for0.1status.message.partsis only a human-readable fallback channel- business code should keep branching on
evt.Response.Error, not on parsed text content
Complete server and client setup:
If you integrate a third-party A2A provider with a different metadata convention, business code should extend the framework at the converter layer:
- keep the framework error model as
Response.Error - implement
a2aagent.A2AEventConverter - register it with
a2aagent.WithCustomEventConverter(...)
That is the correct place for provider-specific adaptation. The business code should not re-invent a second error transport format after the conversion step.
Recommended ownership model
The cleanest production split is:
- framework: collect, propagate, serialize, and expose structured errors
- business error package: define code constants and typed errors
- graph policy: decide recoverable versus fatal behavior
- runner consumer: persist
ExecutionErrorsfromrunner.completion - transport consumer: use
event.IsTerminalError()together withResponse.Errorfor terminal failure handling
That split is broad enough to replace an older business-side node-error helper without taking ownership of your domain-specific error taxonomy.
Example code
See these runnable examples:
examples/graph/error_handlingexamples/a2aagent/error_handling
The graph example shows recoverable and fatal node errors with final state reading.
The A2A example shows server-side structured task errors and client-side
reconstruction into Response.Error.