Document Source Configuration
Example Code: examples/knowledge/sources
The source module provides various document source types, each supporting rich configuration options.
Supported Document Source Types
| Source Type | Description | Example |
|---|---|---|
| File Source (file) | Single file processing | Example |
| Directory Source (dir) | Batch directory processing | Example |
| Repo Source (repo) | Git repository / local repo directory | AST Example |
| URL Source (url) | Fetch content from web pages | Example |
| Auto Source (auto) | Intelligent type detection | Example |
File Source
Single file processing, supports .txt, .md, .json, .doc, .csv, and other formats:
Directory Source
Batch directory processing with recursive and filtering support:
URL Source
Fetch content from web pages and APIs:
URL Source Advanced Configuration
Separate content fetching and document identification:
Note: When using
WithContentFetchingURL, the identifier URL should retain the file information from the content fetching URL, for example: - Correct: Identifier URL ishttps://trpc-go.com/docs/api.md, fetching URL ishttps://github.com/.../docs/api.md- Incorrect: Identifier URL ishttps://trpc-go.com, loses document path information
Auto Source
Intelligent type detection, automatically selects processor:
Repo Source
The repo source targets code repository scenarios, suited for:
- Loading a remote Git URL directly
- Loading a locally checked-out repository directory
- Uniformly processing Go / Python / Proto / Markdown and other content within a single repository
Current open-source status: AST-aware code parsing is currently open-sourced for Go, Python, and Proto / PB. Support for
C++,JavaScript, and other languages is being progressively open-sourced. For languages not yet open-sourced, the repo source can still process text files via plain document readers, but without AST-level semantic entities.
Typical Use Cases
- Loading a remote Git repository to build a code knowledge base
- Loading a local repository restricted to a specific subdirectory
- Unified ingest of Go + Python + Markdown (and other supported types) within a single repository
Basic Usage
The Go AST reader and Python AST reader are optional modules that require blank imports for registration:
- Scanning
.gofiles →knowledge/document/reader/golang - Scanning
.pyfiles →knowledge/document/reader/python
The Proto reader is registered by default and needs no extra import.
Note: The Python reader uses an embedded Python script for AST parsing. It requires Python 3.9+ installed on the system (only uses the standard library
astmodule, no third-party dependencies).
Repository Struct
Repository describes a single repository input with independent version and scope configuration:
| Field | Description |
|---|---|
URL |
Remote Git repository URL |
Dir |
Local repository directory |
Branch |
Target branch |
Tag |
Target tag |
Commit |
Target commit |
Subdir |
Scan only a subdirectory within the repository |
RepoName |
Custom repository name |
RepoURL |
Custom repository URL (overrides auto-detection) |
URLandDirare mutually exclusive. A singlerepo.Sourceprocesses only one repository input.
Version Selection Priority
When multiple version fields are set, the priority is:
CommitTagBranch
That is, if both Commit and Branch are provided, Commit is checked out.
Scan Scope Control
WithFileExtensions— controls which file extensions are scannedWithSkipDirs— controls which directory names are skippedWithSkipSuffixes— controls which file suffixes are skippedRepository.Subdir— restricts scanning to a subdirectory within the repository
Example: scan only Go and Markdown files under server/:
Metadata
The repo source enriches documents produced by readers with repository-level metadata:
| Metadata Key | Description |
|---|---|
trpc_agent_go_source=repo |
Document originates from a repo source |
trpc_agent_go_repo_path |
Local root directory of the cloned repository |
trpc_ast_repo_name |
Repository name |
trpc_ast_repo_url |
Repository URL |
trpc_ast_branch |
Version identifier being parsed (branch/tag/commit) |
trpc_ast_file_path |
Repo-relative file path |
Notes:
trpc_ast_file_pathrepresents the logical path within the repository, not a remote Git URL.- For Git URL inputs, the repo source first clones to a temporary directory, then writes the repo-relative path into
trpc_ast_file_path.
Relation to AST Readers
The repo source does not parse code itself; it dispatches to the appropriate reader based on file type:
.go→ Go AST reader.py→ Python AST reader.proto→ Proto AST reader.md→ Markdown reader- Other registered extensions → corresponding reader
Parsed Output Example
For AST-aware files (.go / .py / .proto), the repo source chunks code by semantic entity. Each chunk contains three layers:
- content: A semantically complete code fragment (e.g., a full struct/class/function definition), not character-truncated text
- embedding text: A structured summary (name / signature / comment, etc.) optimized for vector retrieval
- metadata:
trpc_ast_*fields (type / full_name / language / file_path, etc.) for precise filtering and locating
Below is an example chunk for a Go struct:
For Python files, chunking follows the same approach at Class / Function / Method granularity; for .proto files, it chunks by service / rpc / message / enum.
Code Graph (GraphRAG)
Example Code: examples/knowledge/features/graphrag
Beyond vector retrieval (embedding + vector store), the repo source also supports storing structural code relationships as a graph. AST readers extract edges between entities during parsing. Combined with a graph database (Apache AGE), this enables structural code navigation.

Edge Types
| Edge Type | Meaning | Example |
|---|---|---|
CALLS |
Function/method call | main → server.Start |
METHOD |
Method of a class/struct | Server → Server.Start |
FIELD |
Struct field | Server → services |
PARAM |
Function parameter | NewServer → opts |
RETURNS |
Function return type | NewServer → Server |
INHERITS |
Inheritance/implementation | MyRunner → BaseRunner |
CONTAINS |
Containment | package server → Server |
Usage
Agent Graph Tools
| Tool | Function |
|---|---|
code_graph_search |
Vector search for AST nodes, returns matching code entities |
code_graph_traverse |
Traverse related nodes from a given start node along edges (e.g., find all callers of a function) |
code_graph_find_paths |
Find paths between two code entities (e.g., trace a call chain) |
By locating entry nodes through vector search and then exploring structural relationships via graph traversal, the Agent can understand code architecture without reading large amounts of source code.
Combined Usage
Chunking Strategy
Example Code: fixed-chunking | recursive-chunking
Chunking is the process of splitting long documents into smaller fragments, which is crucial for vector retrieval. The framework provides multiple built-in chunking strategies and supports custom strategies.
Built-in Chunking Strategies
| Strategy | Description | Use Case |
|---|---|---|
| FixedSizeChunking | Fixed-size chunking | General text, simple and fast |
| RecursiveChunking | Recursive chunking by separator hierarchy | Preserving semantic integrity |
| MarkdownChunking | Chunk by Markdown structure | Markdown documents (default) |
| JSONChunking | Chunk by JSON structure | JSON files (default) |
Default Behavior
Each file type has an associated chunking strategy:
.mdfiles → MarkdownChunking (recursively chunk by heading levels H1→H6→paragraph→fixed size).jsonfiles → JSONChunking (chunk by JSON structure).txt/.csv/.docxetc. → FixedSizeChunking
Default Parameters:
| Parameter | Default | Description |
|---|---|---|
| ChunkSize | 1024 | Maximum characters per chunk |
| Overlap | 128 | Overlapping characters between adjacent chunks |
Default chunking strategies are affected by
chunkSizeparameter. Theoverlapparameter only applies to FixedSizeChunking, RecursiveChunking, and MarkdownChunking. JSONChunking does not support overlap.
Adjust default strategy parameters via WithChunkSize and WithChunkOverlap:
Custom Chunking Strategy
Use WithCustomChunkingStrategy to override the default chunking strategy.
Note: Custom chunking strategy completely overrides
WithChunkSizeandWithChunkOverlapconfigurations. Chunking parameters must be set within the custom strategy.
FixedSizeChunking - Fixed Size Chunking
Splits text by fixed character count with overlap support:
RecursiveChunking - Recursive Chunking
Recursively splits by separator hierarchy, preferring natural boundaries:
Separator Priority Explanation:
\n\n- First try to split by paragraph\n- Then split by line.- Then split by sentence- Split by space
Recursive chunking attempts to use higher priority separators, only using the next level separator when chunks still exceed the maximum size. If all separators fail to split text within chunkSize, it will force split by chunkSize.
Configuring Metadata
To enable filter functionality, it's recommended to add rich metadata when creating document sources.
For detailed filter usage guide, please refer to Filter Documentation.
Content Transformer
Example Code: examples/knowledge/features/transform
Transformer is used to preprocess and postprocess content before and after document chunking. This is particularly useful for cleaning text extracted from PDFs, web pages, and other sources, removing excess whitespace, duplicate characters, and other noise.
Processing Flow
Built-in Transformers
CharFilter - Character Filter
Removes specified characters or strings:
CharDedup - Character Deduplicator
Merges consecutive duplicate characters or strings into a single instance:
Usage
Transformers are passed to various document sources via the WithTransformers option:
Combining Multiple Transformers
Multiple transformers are executed in sequence:
Typical Use Cases
| Scenario | Recommended Configuration |
|---|---|
| PDF text cleanup | CharDedup(" ", "\n") - Merge excess spaces and newlines from PDF extraction |
| Web content processing | CharFilter("\t") + CharDedup(" ") - Remove tabs and merge spaces |
| Code documentation processing | CharDedup("\n") - Merge excess blank lines, preserve code indentation |
| General text cleanup | CharFilter("\r") + CharDedup(" ", "\n") - Remove carriage returns and merge whitespace |
PDF File Support
Since the PDF reader depends on third-party libraries, to avoid introducing unnecessary dependencies in the main module, the PDF reader uses a separate go.mod.
To support PDF file reading, manually import the PDF reader package in your code:
Note: Readers for other formats (.txt/.md/.csv/.json, etc.) are automatically registered and don't need manual import.