Document Source Configuration
Example Code: examples/knowledge/sources
The source module provides various document source types, each supporting rich configuration options.
Supported Document Source Types
| Source Type | Description | Example |
|---|---|---|
| File Source (file) | Single file processing | Example |
| Directory Source (dir) | Batch directory processing | Example |
| Repo Source (repo) | Git repository / local repo directory | AST Example |
| URL Source (url) | Fetch content from web pages | Example |
| Auto Source (auto) | Intelligent type detection | Example |
File Source
Single file processing, supports .txt, .md, .json, .doc, .csv, and other formats:
Directory Source
Batch directory processing with recursive and filtering support:
URL Source
Fetch content from web pages and APIs:
URL Source Advanced Configuration
Separate content fetching and document identification:
Note: When using
WithContentFetchingURL, the identifier URL should retain the file information from the content fetching URL, for example: - Correct: Identifier URL ishttps://trpc-go.com/docs/api.md, fetching URL ishttps://github.com/.../docs/api.md- Incorrect: Identifier URL ishttps://trpc-go.com, loses document path information
Auto Source
Intelligent type detection, automatically selects processor:
Repo Source
The repo source targets code repository scenarios, suited for:
- Loading a remote Git URL directly
- Loading a locally checked-out repository directory
- Uniformly processing Go / Proto / Markdown and other content within a single repository
Current open-source status: AST-aware code parsing is currently open-sourced for Go and Proto / PB. Support for
Python,C++,JavaScript, and other languages is being progressively open-sourced. For languages not yet open-sourced, the repo source can still process text files via plain document readers, but without AST-level semantic entities.
Typical Use Cases
- Loading a remote Git repository to build a code knowledge base
- Loading a local repository restricted to a specific subdirectory
- Unified ingest of Go + Markdown (and other supported types) within a single repository
Basic Usage
Repository Struct
Repository describes a single repository input with independent version and scope configuration:
| Field | Description |
|---|---|
URL |
Remote Git repository URL |
Dir |
Local repository directory |
Branch |
Target branch |
Tag |
Target tag |
Commit |
Target commit |
Subdir |
Scan only a subdirectory within the repository |
RepoName |
Custom repository name |
RepoURL |
Custom repository URL (overrides auto-detection) |
URLandDirare mutually exclusive. A singlerepo.Sourceprocesses only one repository input.
Version Selection Priority
When multiple version fields are set, the priority is:
CommitTagBranch
That is, if both Commit and Branch are provided, Commit is checked out.
Scan Scope Control
WithFileExtensions— controls which file extensions are scannedWithSkipDirs— controls which directory names are skippedWithSkipSuffixes— controls which file suffixes are skippedRepository.Subdir— restricts scanning to a subdirectory within the repository
Example: scan only Go and Markdown files under server/:
Metadata
The repo source enriches documents produced by readers with repository-level metadata:
| Metadata Key | Description |
|---|---|
trpc_agent_go_source=repo |
Document originates from a repo source |
trpc_agent_go_repo_path |
Local root directory of the cloned repository |
trpc_ast_repo_name |
Repository name |
trpc_ast_repo_url |
Repository URL |
trpc_ast_branch |
Version identifier being parsed (branch/tag/commit) |
trpc_ast_file_path |
Repo-relative file path |
Notes:
trpc_ast_file_pathrepresents the logical path within the repository, not a remote Git URL.- For Git URL inputs, the repo source first clones to a temporary directory, then writes the repo-relative path into
trpc_ast_file_path.
Relation to AST Readers
The repo source does not parse code itself; it dispatches to the appropriate reader based on file type:
.go→ Go AST reader.proto→ Proto AST reader.md→ Markdown reader- Other registered extensions → corresponding reader
Parsed Output Example
Below is a sample output from chunking a struct definition in a remote Go repository:
The output has three layers:
1. content: Raw chunk content
The content field stores the final text written to the knowledge base. For AST-aware Go / Proto readers, content is not a character-truncated fragment but a semantically complete code entity.
In the example above, the entity is the Server struct, so the content includes the struct comment, type Server struct { ... }, and all field definitions.
2. embedding text: Structured summary for vectorization
embedding text is a compact summary optimized for semantic embedding, retaining fields such as name, full_name, package, signature, comment, and file_path. This helps embeddings focus on "what this entity is, which package it belongs to, and what it does."
3. metadata: Filtering, locating, and display
metadata is primarily used for retrieval filtering, display, and source tracking — not for embedding.
trpc_agent_go_*
Framework-level metadata describing the document origin:
trpc_agent_go_source=repo: document comes from a repo sourcetrpc_agent_go_file_path: repo-relative file pathtrpc_agent_go_repo_path: local root of the cloned repositorytrpc_agent_go_uri: actual file URI
trpc_ast_*
AST semantic metadata describing the code entity:
trpc_ast_type=Structtrpc_ast_full_nametrpc_ast_signaturetrpc_ast_language=gotrpc_ast_repo_name/trpc_ast_repo_url
These are used for precise filtering, such as retrieving all Struct types in a given package, or all rpc / message definitions in a proto service.
Combined Usage
Chunking Strategy
Example Code: fixed-chunking | recursive-chunking
Chunking is the process of splitting long documents into smaller fragments, which is crucial for vector retrieval. The framework provides multiple built-in chunking strategies and supports custom strategies.
Built-in Chunking Strategies
| Strategy | Description | Use Case |
|---|---|---|
| FixedSizeChunking | Fixed-size chunking | General text, simple and fast |
| RecursiveChunking | Recursive chunking by separator hierarchy | Preserving semantic integrity |
| MarkdownChunking | Chunk by Markdown structure | Markdown documents (default) |
| JSONChunking | Chunk by JSON structure | JSON files (default) |
Default Behavior
Each file type has an associated chunking strategy:
.mdfiles → MarkdownChunking (recursively chunk by heading levels H1→H6→paragraph→fixed size).jsonfiles → JSONChunking (chunk by JSON structure).txt/.csv/.docxetc. → FixedSizeChunking
Default Parameters:
| Parameter | Default | Description |
|---|---|---|
| ChunkSize | 1024 | Maximum characters per chunk |
| Overlap | 128 | Overlapping characters between adjacent chunks |
Default chunking strategies are affected by
chunkSizeparameter. Theoverlapparameter only applies to FixedSizeChunking, RecursiveChunking, and MarkdownChunking. JSONChunking does not support overlap.
Adjust default strategy parameters via WithChunkSize and WithChunkOverlap:
Custom Chunking Strategy
Use WithCustomChunkingStrategy to override the default chunking strategy.
Note: Custom chunking strategy completely overrides
WithChunkSizeandWithChunkOverlapconfigurations. Chunking parameters must be set within the custom strategy.
FixedSizeChunking - Fixed Size Chunking
Splits text by fixed character count with overlap support:
RecursiveChunking - Recursive Chunking
Recursively splits by separator hierarchy, preferring natural boundaries:
Separator Priority Explanation:
\n\n- First try to split by paragraph\n- Then split by line.- Then split by sentence- Split by space
Recursive chunking attempts to use higher priority separators, only using the next level separator when chunks still exceed the maximum size. If all separators fail to split text within chunkSize, it will force split by chunkSize.
Configuring Metadata
To enable filter functionality, it's recommended to add rich metadata when creating document sources.
For detailed filter usage guide, please refer to Filter Documentation.
Content Transformer
Example Code: examples/knowledge/features/transform
Transformer is used to preprocess and postprocess content before and after document chunking. This is particularly useful for cleaning text extracted from PDFs, web pages, and other sources, removing excess whitespace, duplicate characters, and other noise.
Processing Flow
Built-in Transformers
CharFilter - Character Filter
Removes specified characters or strings:
CharDedup - Character Deduplicator
Merges consecutive duplicate characters or strings into a single instance:
Usage
Transformers are passed to various document sources via the WithTransformers option:
Combining Multiple Transformers
Multiple transformers are executed in sequence:
Typical Use Cases
| Scenario | Recommended Configuration |
|---|---|
| PDF text cleanup | CharDedup(" ", "\n") - Merge excess spaces and newlines from PDF extraction |
| Web content processing | CharFilter("\t") + CharDedup(" ") - Remove tabs and merge spaces |
| Code documentation processing | CharDedup("\n") - Merge excess blank lines, preserve code indentation |
| General text cleanup | CharFilter("\r") + CharDedup(" ", "\n") - Remove carriage returns and merge whitespace |
PDF File Support
Since the PDF reader depends on third-party libraries, to avoid introducing unnecessary dependencies in the main module, the PDF reader uses a separate go.mod.
To support PDF file reading, manually import the PDF reader package in your code:
Note: Readers for other formats (.txt/.md/.csv/.json, etc.) are automatically registered and don't need manual import.