Evaluation provides a comprehensive framework for agent assessment, supporting evaluation data management in both local file and in-memory modes, and offering multi-dimensional evaluation capabilities for agents.
Quick Start
This section describes how to execute the Agent evaluation process in local file system or inmemory mode.
Local File System
local maintains evaluation sets, evaluation metrics, and evaluation results on the local file system.
The EvalSet provides the dataset required for evaluation, including user input and its corresponding expected agent output.
The Metric defines the metric used to measure model performance, including the metric name and corresponding score threshold.
The Evaluator compares the actual session results with the expected session results, calculates the specific score, and determines the evaluation status based on the metric threshold.
The Evaluator Registry maintains the mapping between metric names and corresponding evaluators and supports dynamic registration and search of evaluators.
The Evaluation Service, as a core component, integrates the Agent to be evaluated, the EvalSet, the Metric, the Evaluator Registry, and the EvalResult Registry. The evaluation process is divided into two phases:
Inference: Extracting user input from the EvalSet, invoking the Agent to perform inference, and combining the Agent's actual output with the expected output to form the inference result.
Result Evaluation Phase: Evaluate retrieves the corresponding evaluator from the registry based on the evaluation metric name. Multiple evaluators are used to perform a multi-dimensional evaluation of the inference results, ultimately generating the evaluation result, EvalResult.
Agent Evaluator: To reduce the randomness of the agent's output, the evaluation service is called NumRuns times and aggregates the results to obtain a more stable evaluation result.
EvalSet
An EvalSet is a collection of EvalCase instances, identified by a unique EvalSetID, serving as session data within the evaluation process.
An EvalCase represents a set of evaluation cases within the same Session and includes a unique identifier (EvalID), the conversation content, and session initialization information.
Conversation data includes three types of content:
import("google.golang.org/genai""trpc.group/trpc-go/trpc-agent-go/evaluation/internal/epochtime")// EvalSet represents an evaluation set.typeEvalSetstruct{EvalSetIDstring// Unique identifier of the evaluation set.Namestring// Evaluation set name.Descriptionstring// Evaluation set description.EvalCases[]*EvalCase// All evaluation cases.CreationTimestamp*epochtime.EpochTime// Creation time.}// EvalCase represents a single evaluation case.typeEvalCasestruct{EvalIDstring// Unique identifier of the case.Conversation[]*Invocation// Conversation sequence.SessionInput*SessionInput// Session initialization data.CreationTimestamp*epochtime.EpochTime// Creation time.}// Invocation represents a user-agent interaction.typeInvocationstruct{InvocationIDstringUserContent*genai.Content// User input.FinalResponse*genai.Content// Agent final response.IntermediateData*IntermediateData// Agent intermediate response data.CreationTimestamp*epochtime.EpochTime// Creation time.}// IntermediateData represents intermediate data during execution.typeIntermediateDatastruct{ToolUses[]*genai.FunctionCall// Tool call.ToolResponses[]*genai.FunctionResponse// Tool response.IntermediateResponses[][]any// Intermediate response, including source and content.}// SessionInput represents session initialization input.typeSessionInputstruct{AppNamestring// Application name.UserIDstring// User ID.Statemap[string]interface{}// Initial state.}
The EvalSet Manager is responsible for performing operations such as adding, deleting, modifying, and querying evaluation sets. The interface definition is as follows:
typeManagerinterface{// Get the specified EvalSet.Get(ctxcontext.Context,appName,evalSetIDstring)(*EvalSet,error)// Create a new EvalSet.Create(ctxcontext.Context,appName,evalSetIDstring)(*EvalSet,error)// List all EvalSet IDs.List(ctxcontext.Context,appNamestring)([]string,error)// Delete the specified EvalSet.Delete(ctxcontext.Context,appName,evalSetIDstring)error// Get the specified case.GetCase(ctxcontext.Context,appName,evalSetID,evalCaseIDstring)(*EvalCase,error)// Add a case to the evaluation set.AddCase(ctxcontext.Context,appName,evalSetIDstring,evalCase*EvalCase)error// Update a case.UpdateCase(ctxcontext.Context,appName,evalSetIDstring,evalCase*EvalCase)error// Delete case.DeleteCase(ctxcontext.Context,appName,evalSetID,evalCaseIDstring)error}
The framework provides two implementations of the EvalSet Manager:
local: Stores the evaluation set in the local file system, with a file name format of <EvalSetID>.evalset.json.
inmemory: Stores the evaluation set in memory, ensuring a deep copy of all operations. This is suitable for temporary testing scenarios.
Metric
Metric represents an evaluation indicator used to measure a certain aspect of EvalSet’s performance. Each evaluation indicator includes the metric name, evaluation criterion, and score threshold.
During the evaluation process, the evaluator compares the actual conversation with the expected conversation according to the configured evaluation criterion, calculates the evaluation score for this metric, and compares it with the threshold:
When the evaluation score is lower than the threshold, the metric is determined as not passed.
When the evaluation score reaches or exceeds the threshold, the metric is determined as passed.
import("trpc.group/trpc-go/trpc-agent-go/evaluation/metric/criterion""trpc.group/trpc-go/trpc-agent-go/evaluation/metric/criterion/tooltrajectory")// EvalMetric represents a single metric used to evaluate an EvalCase.typeEvalMetricstruct{MetricNamestring// Metric name.Thresholdfloat64// Score threshold.Criterion*criterion.Criterion// Evaluation criterion.}// Criterion aggregates various evaluation criteria.typeCriterionstruct{ToolTrajectory*tooltrajectory.ToolTrajectoryCriterion// Tool trajectory evaluation criterion.}
The Metric Manager is responsible for managing evaluation metrics.
Each EvalSet can have multiple evaluation metrics, identified by MetricName.
typeManagerinterface{// Returns all metric names for a specified EvalSet.List(ctxcontext.Context,appName,evalSetIDstring)([]string,error)// Gets a single metric from a specified EvalSet.Get(ctxcontext.Context,appName,evalSetID,metricNamestring)(*EvalMetric,error)// Adds the metric to a specified EvalSet.Add(ctxcontext.Context,appName,evalSetIDstring,metric*EvalMetric)error// Deletes the specified metric.Delete(ctxcontext.Context,appName,evalSetID,metricNamestring)error// Updates the specified metric.Update(ctxcontext.Context,appName,evalSetIDstring,metric*EvalMetric)error}
The framework provides two implementations of the Metric Manager:
local: Stores evaluation metrics in the local file system, with file names in the format <EvalSetID>.metric.json.
inmemory: Stores evaluation metrics in memory, ensuring a deep copy for all operations. Suitable for temporary testing or quick verification scenarios.
Evaluator
The Evaluator calculates the final evaluation result based on actual sessions, expected sessions, and the evaluation metric.
The Evaluator outputs the following:
Overall evaluation score
Overall evaluation status
A list of session-by-session evaluation results
The evaluation results for a single session include:
Actual sessions
Expected sessions
Evaluation score
Evaluation status
The evaluation status is typically determined by both the score and the metric threshold:
If the evaluation score ≥ the metric threshold, the status is Passed
If the evaluation score < the metric threshold, the status is Failed
Note: The evaluator name Evaluator.Name() must match the evaluation metric name metric.MetricName.
import("trpc.group/trpc-go/trpc-agent-go/evaluation/evalset""trpc.group/trpc-go/trpc-agent-go/evaluation/metric""trpc.group/trpc-go/trpc-agent-go/evaluation/status")// Evaluator defines the general interface for evaluators.typeEvaluatorinterface{// Name returns the evaluator name.Name()string// Description returns the evaluator description.Description()string// Evaluate executes the evaluation logic, compares the actual and expected sessions, and returns the result.Evaluate(ctxcontext.Context,actuals,expecteds[]*evalset.Invocation,evalMetric*metric.EvalMetric)(*EvaluateResult,error)}// EvaluateResult represents the aggregated results of the evaluator across multiple sessions.typeEvaluateResultstruct{OverallScorefloat64// Overall score.OverallStatusstatus.EvalStatus// Overall status, categorized as passed/failed/not evaluated.PerInvocationResults[]PerInvocationResult// Evaluation results for a single session.}// PerInvocationResult represents the evaluation results for a single session.typePerInvocationResultstruct{ActualInvocation*evalset.Invocation// Actual session.ExpectedInvocation*evalset.Invocation// Expected session.Scorefloat64// Current session score.Statusstatus.EvalStatus// Current session status.}
Registry
Registry is used to centrally manage and access various evaluators.
Methods include:
Register(name string, e Evaluator): Registers an evaluator with a specified name.
Get(name string): Gets an evaluator instance by name.
import"trpc.group/trpc-go/trpc-agent-go/evaluation/evaluator"// Registry defines the evaluator registry interface.typeRegistryinterface{// Register registers an evaluator with the global registry.Register(namestring,eevaluator.Evaluator)error// Get gets an instance by evaluator name.Get(namestring)(evaluator.Evaluator,error)}
The framework registers the following evaluators by default:
If the actual tool call sequence is exactly the same as the expected one, a score of 1 is assigned;
If not, a score of 0 is assigned.
For multiple sessions: The final score is calculated by averaging the scores from each session.
EvalResult
The EvalResult module is used to record and manage evaluation result data.
EvalSetResult records the evaluation results of an evaluation set (EvalSetID) and contains multiple EvalCaseResults, which display the execution status and score details of each evaluation case.
import"trpc.group/trpc-go/trpc-agent-go/evaluation/internal/epochtime"// EvalSetResult represents the overall evaluation result of the evaluation set.typeEvalSetResultstruct{EvalSetResultIDstring// Unique identifier of the evaluation result.EvalSetResultNamestring// Evaluation result name.EvalSetIDstring// Corresponding evaluation set ID.EvalCaseResults[]*EvalCaseResult// Results of each evaluation case.CreationTimestamp*epochtime.EpochTime// Result creation time.}
EvalCaseResult represents the evaluation result of a single evaluation case, including the overall evaluation status, scores for each indicator, and evaluation details for each round of dialogue.
import"trpc.group/trpc-go/trpc-agent-go/evaluation/status"// EvalCaseResult represents the evaluation result of a single evaluation case.typeEvalCaseResultstruct{EvalSetIDstring// Evaluation set ID.EvalIDstring// Unique identifier of the case.FinalEvalStatusstatus.EvalStatus// Final evaluation status of the case.OverallEvalMetricResults[]*EvalMetricResult// Overall score for each metric.EvalMetricResultPerInvocation[]*EvalMetricResultPerInvocation// Metric evaluation results per invocation.SessionIDstring// Session ID generated during the inference phase.UserIDstring// User ID used during the inference phase.}
EvalMetricResult represents the evaluation result of a specific metric, including the score, status, threshold, and additional information.
import("trpc.group/trpc-go/trpc-agent-go/evaluation/metric/criterion""trpc.group/trpc-go/trpc-agent-go/evaluation/status")// EvalMetricResult represents the evaluation result of a single metric.typeEvalMetricResultstruct{MetricNamestring// Metric name.Scorefloat64// Actual score.EvalStatusstatus.EvalStatus// Evaluation status.Thresholdfloat64// Score threshold.Criterion*criterion.Criterion// Evaluation criterion.Detailsmap[string]any// Additional information, such as scoring process, error description, etc.}
EvalMetricResultPerInvocation represents the metric-by-metric evaluation result of a single conversation turn, used to analyze the performance differences of a specific conversation under different metrics.
import"trpc.group/trpc-go/trpc-agent-go/evaluation/evalset"// EvalMetricResultPerInvocation represents the metric-by-metric evaluation results for a single conversation.typeEvalMetricResultPerInvocationstruct{ActualInvocation*evalset.Invocation// Actual conversation executed.ExpectedInvocation*evalset.Invocation// Expected conversation result.EvalMetricResults[]*EvalMetricResult// Evaluation results for each metric.}
The EvalResult Manager manages the storage, query, and list operations of evaluation results. The interface definition is as follows:
// Manager defines the management interface for evaluation results.typeManagerinterface{// Save saves the evaluation result and returns the EvalSetResultID.Save(ctxcontext.Context,appNamestring,evalSetResult*EvalSetResult)(string,error)// Get retrieves the specified evaluation result based on the evalSetResultID.Get(ctxcontext.Context,appName,evalSetResultIDstring)(*EvalSetResult,error)// List returns all evaluation result IDs for the specified application.List(ctxcontext.Context,appNamestring)([]string,error)}
The framework provides two implementations of the EvalResult Manager:
local: Stores the evaluation results in the local file system. The default file name format is <EvalSetResultID>.evalset_result.json. The default naming convention for EvalSetResultID is <appName>_<EvalSetID>_<UUID>.
inmemory: Stores the evaluation results in memory. All operations ensure a deep copy, which is suitable for debugging and quick verification scenarios.
Service
Service is an evaluation service that integrates the following modules:
EvalSet
Metric
Registry
Evaluator
EvalSetResult
The Service interface defines the complete evaluation process, including the inference and evaluation phases. The interface definition is as follows:
import"trpc.group/trpc-go/trpc-agent-go/evaluation/evalset"// Service defines the core interface of the evaluation service.typeServiceinterface{// Inference performs inference, calls the Agent to process the specified evaluation case, // and returns the inference result.Inference(ctxcontext.Context,request*InferenceRequest)([]*InferenceResult,error)// Evaluate evaluates the inference result, generates and persists the evaluation result.Evaluate(ctxcontext.Context,request*EvaluateRequest)(*evalresult.EvalSetResult,error)}
The framework provides a default local evaluation service local implementation for the Service interface: it calls the local Agent to perform reasoning and evaluation locally.
Inference
The inference phase is responsible for running the agent and capturing the actual responses to the test cases.
The input is InferenceRequest, and the output is a list of InferenceResult.
// InferenceRequest represents an inference request.typeInferenceRequeststruct{AppNamestring// Application name.EvalSetIDstring// Evaluation set ID.EvalCaseIDs[]string// List of evaluation case IDs to be inferred.}
Description:
AppName specifies the application name.
EvalSetID specifies the evaluation set.
EvalCaseIDs specifies the list of use cases to be evaluated. If left blank, all use cases in the evaluation set are evaluated by default.
During the inference phase, the system sequentially reads the Invocation of each evaluation case, uses the UserContent as user input to invoke the agent, and records the agent's response.
import"trpc.group/trpc-go/trpc-agent-go/evaluation/status"// InferenceResult represents the inference result of a single evaluation case.typeInferenceResultstruct{AppNamestring// Application name.EvalSetIDstring// Evaluation set ID.EvalCaseIDstring// Evaluation case ID.Inferences[]*evalset.Invocation// Session ID for the actual inference.SessionIDstring// Session ID for the inference phase.Statusstatus.EvalStatus// Inference status.ErrorMessagestring// Error message if inference fails.}
Note:
Each InferenceResult corresponds to an EvalCase.
Since an evaluation set may contain multiple evaluation cases, Inference returns a list of InferenceResults.
Evaluate
The evaluation phase evaluates inference results. Its input is EvaluateRequest, and its output is the evaluation result EvalSetResult.
import"trpc.group/trpc-go/trpc-agent-go/evaluation/metric"// EvaluateRequest represents an evaluation request.typeEvaluateRequeststruct{AppNamestring// Application name.EvalSetIDstring// Evaluation set ID.InferenceResults[]*InferenceResult// Inference phase results.EvaluateConfig*EvaluateConfig// Evaluation configuration.}// EvaluateConfig represents the configuration for the evaluation phase.typeEvaluateConfigstruct{EvalMetrics[]*metric.EvalMetric// Metric set to be evaluated.}
Description:
The framework will call the corresponding evaluator based on the configured EvalMetrics to perform evaluation and scoring.
Each metric result will be aggregated into the final EvalSetResult.
AgentEvaluator
AgentEvaluator evaluates an agent based on the configured evaluation set EvalSetID.
// AgentEvaluator evaluates an agent based on an evaluation set.typeAgentEvaluatorinterface{// Evaluate evaluates the specified evaluation set.Evaluate(ctxcontext.Context,evalSetIDstring)(*EvaluationResult,error)}
EvaluationResult represents the final result of a complete evaluation task, including the overall evaluation status, execution time, and a summary of the results of all evaluation cases.
import"trpc.group/trpc-go/trpc-agent-go/evaluation/status"// EvaluationResult contains the aggregated results of multiple evaluation runs.typeEvaluationResultstruct{AppNamestring// Application name.EvalSetIDstring// Corresponding evaluation set ID.OverallStatusstatus.EvalStatus// Overall evaluation status.ExecutionTimetime.Duration// Execution duration.EvalCases[]*EvaluationCaseResult// Results of each evaluation case.}
EvaluationCaseResult aggregates the results of multiple runs of a single evaluation case, including the overall evaluation status, detailed results of each run, and metric-level statistics.
import("trpc.group/trpc-go/trpc-agent-go/evaluation/evalresult""trpc.group/trpc-go/trpc-agent-go/evaluation/status")// EvaluationCaseResult summarizes the results of a single evaluation case across multiple executions.typeEvaluationCaseResultstruct{EvalCaseIDstring// Evaluation case ID.OverallStatusstatus.EvalStatus// Overall evaluation status.EvalCaseResults[]*evalresult.EvalCaseResult// Individual run results.MetricResults[]*evalresult.EvalMetricResult// Metric-level results.}
An AgentEvaluator instance can be created using evaluation.New. By default, it uses the local implementations of the EvalSet Manager, Metric Manager, and EvalResult Manager.
Because the Agent's execution process may be uncertain, evaluation.WithNumRuns provides a mechanism for multiple evaluation runs to reduce the randomness of a single run.
The default number of runs is 1;
By specifying evaluation.WithNumRuns(n), each evaluation case can be run multiple times;
The final result is based on the combined statistical results of multiple runs. The default statistical method is the average of the evaluation scores of multiple runs.
Usage Guide
Debug Server Integration
Debug Server bundles evaluation management and run endpoints so you can drive visual evaluation flows from ADK Web/AG UI.
import("net/http""trpc.group/trpc-go/trpc-agent-go/agent""trpc.group/trpc-go/trpc-agent-go/evaluation/evalresult"evalresultlocal"trpc.group/trpc-go/trpc-agent-go/evaluation/evalresult/local""trpc.group/trpc-go/trpc-agent-go/evaluation/evalset"evalsetlocal"trpc.group/trpc-go/trpc-agent-go/evaluation/evalset/local""trpc.group/trpc-go/trpc-agent-go/evaluation/metric"metriclocal"trpc.group/trpc-go/trpc-agent-go/evaluation/metric/local"debugserver"trpc.group/trpc-go/trpc-agent-go/server/debug")agents:=map[string]agent.Agent{"math-app":myAgent,}srv:=debugserver.New(agents,debugserver.WithEvalSetManager(evalsetlocal.New(evalset.WithBaseDir("./evaldata"))),debugserver.WithEvalResultManager(evalresultlocal.New(evalresult.WithBaseDir("./evaldata"))),debugserver.WithMetricManager(metriclocal.New(metric.WithBaseDir("./evaldata"))),)// Debug Server returns an http.Handler; register it to your HTTP server._=http.ListenAndServe(":8000",srv.Handler())
In addition, if the default path structure does not meet your requirements, you can customize the file path rules by implementing the Locator interface. The interface definition is as follows:
// Locator is used to define the path generation and enumeration logic for evaluation set files.typeLocatorinterface{// Build specifies the appName and evalSetID Path to the evaluation set file.Build(baseDir,appName,evalSetIDstring)string// List all evaluation set IDs under the specified appName.List(baseDir,appNamestring)([]string,error)}
For example, set the evaluation set file format to custom-<EvalSetID>.evalset.json.
import("trpc.group/trpc-go/trpc-agent-go/evaluation""trpc.group/trpc-go/trpc-agent-go/evaluation/evalset"evalsetlocal"trpc.group/trpc-go/trpc-agent-go/evaluation/evalset/local")evalSetManager:=evalsetlocal.New(evalset.WithLocator(&customLocator{}))agentEvaluator,err:=evaluation.New(appName,runner,evaluation.WithEvalSetManager(evalSetManager))typecustomLocatorstruct{}// Build returns the custom file path format: <BaseDir>/<AppName>/custom-<EvalSetID>.evalset.json.func(l*customLocator)Build(baseDir,appName,EvalSetIDstring)string{returnfilepath.Join(baseDir,appName,"custom-"+evalSetID+".evalset.json")}// List lists all evaluation set IDs under the specified app.func(l*customLocator)List(baseDir,appNamestring)([]string,error){dir:=filepath.Join(baseDir,appName)entries,err:=os.ReadDir(dir)iferr!=nil{iferrors.Is(err,os.ErrNotExist){return[]string{},nil}returnnil,err}varresults[]stringfor_,entry:=rangeentries{ifentry.IsDir(){continue}ifstrings.HasSuffix(entry.Name(),".evalset.json"){name:=strings.TrimPrefix(entry.Name(),"custom-")name=strings.TrimSuffix(name,defaultResultFileSuffix)results=append(results,name)}}returnresults,nil}
Metric File
The default path for the metrics file is ./<AppName>/<EvalSetID>.metrics.json.
You can use WithBaseDir to set a custom BaseDir, meaning the file path will be <BaseDir>/<AppName>/<EvalSetID>.metrics.json.
In addition, if the default path structure does not meet your requirements, you can customize the file path rules by implementing the Locator interface. The interface definition is as follows:
// Locator is used to define the path generation for evaluation metric files.typeLocatorinterface{// Build builds the evaluation metric file path for the specified appName and evalSetID.Build(baseDir,appName,evalSetIDstring)string}
For example, set the evaluation set file format to custom-<EvalSetID>.metrics.json.
import("trpc.group/trpc-go/trpc-agent-go/evaluation""trpc.group/trpc-go/trpc-agent-go/evaluation/metric"metriclocal"trpc.group/trpc-go/trpc-agent-go/evaluation/metric/local")metricManager:=metriclocal.New(metric.WithLocator(&customLocator{}))agentEvaluator,err:=evaluation.New(appName,runner,evaluation.WithMetricManager(metricManager))typecustomLocatorstruct{}// Build returns the custom file path format: <BaseDir>/<AppName>/custom-<EvalSetID>.metrics.json.func(l*customLocator)Build(baseDir,appName,EvalSetIDstring)string{returnfilepath.Join(baseDir,appName,"custom-"+evalSetID+".metrics.json")}
EvalResult File
The default path for the evaluation result file is ./<AppName>/<EvalSetResultID>.evalresult.json.
You can set a custom BaseDir using WithBaseDir. For example, the file path will be <BaseDir>/<AppName>/<EvalSetResultID>.evalresult.json. The default naming convention for EvalSetResultID is <appName>_<EvalSetID>_<UUID>.
In addition, if the default path structure does not meet your requirements, you can customize the file path rules by implementing the Locator interface. The interface definition is as follows:
// Locator is used to define the path generation and enumeration logic for evaluation result files.typeLocatorinterface{// Build the specified appName and The evaluation result file path for the evalSetResultID.Build(baseDir,appName,evalSetResultIDstring)string// List all evaluation result IDs under the specified appName.List(baseDir,appNamestring)([]string,error)}
For example, set the evaluation result file format to custom-<EvalSetResultID>.evalresult.json.
import("trpc.group/trpc-go/trpc-agent-go/evaluation""trpc.group/trpc-go/trpc-agent-go/evaluation/evalresult"evalresultlocal"trpc.group/trpc-go/trpc-agent-go/evaluation/evalresult/local")evalResultManager:=evalresultlocal.New(evalresult.WithLocator(&customLocator{}))agentEvaluator,err:=evaluation.New(appName,runner,evaluation.WithEvalResultManager(evalResultManager))typecustomLocatorstruct{}// Build returns the custom file path format: <BaseDir>/<AppName>/custom-<EvalSetResultID>.evalresult.json.func(l*customLocator)Build(baseDir,appName,evalSetResultIDstring)string{returnfilepath.Join(baseDir,appName,"custom-"+evalSetResultID+".evalresult.json")}// List lists all evaluation result IDs under the specified app.func(l*customLocator)List(baseDir,appNamestring)([]string,error){dir:=filepath.Join(baseDir,appName)entries,err:=os.ReadDir(dir)iferr!=nil{iferrors.Is(err,os.ErrNotExist){return[]string{},nil}returnnil,err}varresults[]stringfor_,entry:=rangeentries{ifentry.IsDir(){continue}ifstrings.HasSuffix(entry.Name(),".evalresult.json"){name:=strings.TrimPrefix(entry.Name(),"custom-")name=strings.TrimSuffix(name,".evalresult.json")results=append(results,name)}}returnresults,nil}
Evaluation Criterion
The evaluation criterion describes the specific evaluation method and can be combined as needed.
The framework has the following built-in types of evaluation criteria:
Criterion Type
Applicable Object
TextCriterion
Text string
JSONCriterion
JSON object, usually used to compare map[string]any
ToolTrajectoryCriterion
Tool invocation trajectory
Criterion
Aggregation of multiple criteria
TextCriterion
TextCriterion is used for string matching and can be configured to ignore case and to use a specific matching strategy.
// JSONCriterion defines the matching method for JSON objects.typeJSONCriterionstruct{Ignorebool// Whether to skip matching.MatchStrategyJSONMatchStrategy// Matching strategy.Comparefunc(actual,expectedmap[string]any)(bool,error)// Custom comparison.}
Explanation of JSONMatchStrategy values:
JSONMatchStrategy Value
Description
exact
The actual JSON is exactly the same as the expected JSON (default).
ToolTrajectoryCriterion
ToolTrajectoryCriterion is used to configure the evaluation criteria for tool invocations and responses. You can set default strategies, customize strategies by tool name, and control whether to ignore the invocation order.
// ToolTrajectoryCriterion defines the evaluation criteria for tool invocations and responses.typeToolTrajectoryCriterionstruct{DefaultStrategy*ToolTrajectoryStrategy// Default strategy.ToolStrategymap[string]*ToolTrajectoryStrategy// Customized strategies by tool name.OrderInsensitivebool// Whether to ignore invocation order.Comparefunc(actual,expected*evalset.Invocation)(bool,error)// Custom comparison.}// ToolTrajectoryStrategy defines the matching strategy for a single tool.typeToolTrajectoryStrategystruct{Name*TextCriterion// Tool name matching.Arguments*JSONCriterion// Invocation arguments matching.Response*JSONCriterion// Tool response matching.}
DefaultStrategy is used to configure the global default evaluation criterion and applies to all tools.
ToolStrategy overrides the evaluation criterion for specific tools by tool name. When ToolStrategy is not set, all tool invocations use DefaultStrategy.
If no evaluation criterion is configured, the framework uses the default evaluation criterion: tool names are compared using TextCriterion with the exact strategy, and arguments and responses are compared using JSONCriterion with the exact strategy. This ensures that tool trajectory evaluation always has a reasonable fallback behavior.
The following example illustrates a typical scenario: for most tools you want strict alignment of tool invocations and results, but for time-related tools such as current_time, the response value itself is unstable. Therefore, you only need to check whether the correct tool and arguments were invoked as expected, without requiring the time value itself to be exactly the same.
import("trpc.group/trpc-go/trpc-agent-go/evaluation/metric/criterion""trpc.group/trpc-go/trpc-agent-go/evaluation/metric/criterion/json""trpc.group/trpc-go/trpc-agent-go/evaluation/metric/criterion/text""trpc.group/trpc-go/trpc-agent-go/evaluation/metric/criterion/tooltrajectory")criterion:=criterion.New(criterion.WithToolTrajectory(tooltrajectory.New(tooltrajectory.WithDefault(&tooltrajectory.ToolTrajectoryStrategy{Name:&text.TextCriterion{MatchStrategy:text.TextMatchStrategyExact,},Arguments:&json.JSONCriterion{MatchStrategy:json.JSONMatchStrategyExact,},Response:&json.JSONCriterion{MatchStrategy:json.JSONMatchStrategyExact,},},),tooltrajectory.WithTool(map[string]*tooltrajectory.ToolTrajectoryStrategy{"current_time":{Name:&text.TextCriterion{MatchStrategy:text.TextMatchStrategyExact,},Arguments:&json.JSONCriterion{MatchStrategy:json.JSONMatchStrategyExact,},Response:&json.JSONCriterion{Ignore:true,// Ignore matching of this tool's response.},},}),),),)
By default, tool invocations are compared one by one in the order in which they appear. The actual tool invocation sequence and the expected tool invocation sequence must match in length, order, and in the tool name, arguments, and response at each step. If the invocation order is different, the evaluation will be considered as failed.
OrderInsensitive controls whether the tool invocation order is ignored. When enabled, the evaluation logic first generates a sorting key for each tool invocation (composed of the tool name and the normalized representation of arguments and response). It then sorts the actual invocation sequence and the expected invocation sequence by this key, producing two invocation lists with stable order. Next, it compares the corresponding invocations in the sorted lists one by one, and determines whether these invocations match according to the configured evaluation criteria. Put simply, as long as the tool invocations on both sides are completely identical in content, the evaluation will not fail due to differences in the original invocation order. For example:
The metric name corresponding to the tool trajectory evaluator is tool_trajectory_avg_score. It is used to evaluate whether the Agent’s use of tools across multiple conversations conforms to expectations.
In a single conversation, the evaluator compares the actual tool invocation trajectory with the expected trajectory using ToolTrajectoryCriterion:
If the entire tool invocation trajectory satisfies the evaluation criterion, the score of this conversation on this metric is 1.
If any step of the invocation does not satisfy the evaluation criterion, the score of this conversation on this metric is 0.
In the scenario of multiple conversations, the evaluator takes the average of the scores of all conversations on this metric as the final tool_trajectory_avg_score, and compares it with EvalMetric.Threshold to determine whether the result is pass or fail.
A typical way to combine the tool trajectory evaluator with Metric and Criterion is as follows:
import("trpc.group/trpc-go/trpc-agent-go/evaluation/metric""trpc.group/trpc-go/trpc-agent-go/evaluation/metric/criterion"ctooltrajectory"trpc.group/trpc-go/trpc-agent-go/evaluation/metric/criterion/tooltrajectory")evalMetric:=&metric.EvalMetric{MetricName:"tool_trajectory_avg_score",Threshold:1.0,Criterion:criterion.New(criterion.WithToolTrajectory(// Use the default evaluation criterion; tool name, arguments, and response must be strictly identical.ctooltrajectory.New(),),),}