Skip to content

Document Source Configuration

Example Code: examples/knowledge/sources

The source module provides various document source types, each supporting rich configuration options.

Supported Document Source Types

Source Type Description Example
File Source (file) Single file processing Example
Directory Source (dir) Batch directory processing Example
Repo Source (repo) Git repository / local repo directory AST Example
URL Source (url) Fetch content from web pages Example
Auto Source (auto) Intelligent type detection Example

File Source

Single file processing, supports .txt, .md, .json, .doc, .csv, and other formats:

import (
    filesource "trpc.group/trpc-go/trpc-agent-go/knowledge/source/file"
)

fileSrc := filesource.New(
    []string{"./data/llm.md"},
    filesource.WithChunkSize(1000),      // Chunk size
    filesource.WithChunkOverlap(200),    // Chunk overlap
    filesource.WithName("LLM Doc"),
    filesource.WithMetadataValue("type", "documentation"),
)

Directory Source

Batch directory processing with recursive and filtering support:

import (
    dirsource "trpc.group/trpc-go/trpc-agent-go/knowledge/source/dir"
)

dirSrc := dirsource.New(
    []string{"./docs"},
    dirsource.WithRecursive(true),                           // Recursively process subdirectories
    dirsource.WithFileExtensions([]string{".md", ".txt"}),   // File extension filter
    dirsource.WithChunkSize(800),
    dirsource.WithName("Documentation"),
)

URL Source

Fetch content from web pages and APIs:

import (
    urlsource "trpc.group/trpc-go/trpc-agent-go/knowledge/source/url"
)

urlSrc := urlsource.New(
    []string{"https://en.wikipedia.org/wiki/Artificial_intelligence"},
    urlsource.WithChunkSize(1000),
    urlsource.WithChunkOverlap(200),
    urlsource.WithName("Web Content"),
)

URL Source Advanced Configuration

Separate content fetching and document identification:

import (
    urlsource "trpc.group/trpc-go/trpc-agent-go/knowledge/source/url"
)

urlSrcAlias := urlsource.New(
    []string{"https://trpc-go.com/docs/api.md"},     // Identifier URL (for document ID and metadata)
    urlsource.WithContentFetchingURL([]string{"https://github.com/trpc-group/trpc-go/raw/main/docs/api.md"}), // Actual content fetching URL
    urlsource.WithName("TRPC API Docs"),
    urlsource.WithMetadataValue("source", "github"),
)

Note: When using WithContentFetchingURL, the identifier URL should retain the file information from the content fetching URL, for example: - Correct: Identifier URL is https://trpc-go.com/docs/api.md, fetching URL is https://github.com/.../docs/api.md - Incorrect: Identifier URL is https://trpc-go.com, loses document path information

Auto Source

Intelligent type detection, automatically selects processor:

import (
    autosource "trpc.group/trpc-go/trpc-agent-go/knowledge/source/auto"
)

autoSrc := autosource.New(
    []string{
        "Cloud computing provides on-demand access to computing resources.",
        "https://docs.example.com/api",
        "./config.yaml",
    },
    autosource.WithName("Mixed Sources"),
    autosource.WithChunkSize(1000),
)

Repo Source

The repo source targets code repository scenarios, suited for:

  • Loading a remote Git URL directly
  • Loading a locally checked-out repository directory
  • Uniformly processing Go / Python / Proto / Markdown and other content within a single repository

Current open-source status: AST-aware code parsing is currently open-sourced for Go, Python, and Proto / PB. Support for C++, JavaScript, and other languages is being progressively open-sourced. For languages not yet open-sourced, the repo source can still process text files via plain document readers, but without AST-level semantic entities.

Typical Use Cases

  • Loading a remote Git repository to build a code knowledge base
  • Loading a local repository restricted to a specific subdirectory
  • Unified ingest of Go + Python + Markdown (and other supported types) within a single repository

Basic Usage

import (
    _ "trpc.group/trpc-go/trpc-agent-go/knowledge/document/reader/golang"
    _ "trpc.group/trpc-go/trpc-agent-go/knowledge/document/reader/python"
    reposource "trpc.group/trpc-go/trpc-agent-go/knowledge/source/repo"
)

repoSrc := reposource.New(
    reposource.WithRepository(
        reposource.Repository{
            URL:    "https://github.com/trpc-group/trpc-go",
            Branch: "main",
        },
    ),
    reposource.WithName("Code Repository"),
    reposource.WithFileExtensions([]string{".go", ".py", ".md"}),
)

The Go AST reader and Python AST reader are optional modules that require blank imports for registration:

  • Scanning .go files → knowledge/document/reader/golang
  • Scanning .py files → knowledge/document/reader/python

The Proto reader is registered by default and needs no extra import.

Note: The Python reader uses an embedded Python script for AST parsing. It requires Python 3.9+ installed on the system (only uses the standard library ast module, no third-party dependencies).

Repository Struct

Repository describes a single repository input with independent version and scope configuration:

Field Description
URL Remote Git repository URL
Dir Local repository directory
Branch Target branch
Tag Target tag
Commit Target commit
Subdir Scan only a subdirectory within the repository
RepoName Custom repository name
RepoURL Custom repository URL (overrides auto-detection)

URL and Dir are mutually exclusive. A single repo.Source processes only one repository input.

Version Selection Priority

When multiple version fields are set, the priority is:

  1. Commit
  2. Tag
  3. Branch

That is, if both Commit and Branch are provided, Commit is checked out.

Scan Scope Control

  • WithFileExtensions — controls which file extensions are scanned
  • WithSkipDirs — controls which directory names are skipped
  • WithSkipSuffixes — controls which file suffixes are skipped
  • Repository.Subdir — restricts scanning to a subdirectory within the repository

Example: scan only Go and Markdown files under server/:

repoSrc := reposource.New(
    reposource.WithRepository(
        reposource.Repository{
            URL:    "https://github.com/trpc-group/trpc-go",
            Branch: "main",
            Subdir: "server",
        },
    ),
    reposource.WithFileExtensions([]string{".go", ".md"}),
    reposource.WithSkipSuffixes([]string{".pb.go", ".trpc.go", "_mock.go"}),
)

Metadata

The repo source enriches documents produced by readers with repository-level metadata:

Metadata Key Description
trpc_agent_go_source=repo Document originates from a repo source
trpc_agent_go_repo_path Local root directory of the cloned repository
trpc_ast_repo_name Repository name
trpc_ast_repo_url Repository URL
trpc_ast_branch Version identifier being parsed (branch/tag/commit)
trpc_ast_file_path Repo-relative file path

Notes:

  • trpc_ast_file_path represents the logical path within the repository, not a remote Git URL.
  • For Git URL inputs, the repo source first clones to a temporary directory, then writes the repo-relative path into trpc_ast_file_path.

Relation to AST Readers

The repo source does not parse code itself; it dispatches to the appropriate reader based on file type:

  • .go → Go AST reader
  • .py → Python AST reader
  • .proto → Proto AST reader
  • .md → Markdown reader
  • Other registered extensions → corresponding reader

Parsed Output Example

For AST-aware files (.go / .py / .proto), the repo source chunks code by semantic entity. Each chunk contains three layers:

  • content: A semantically complete code fragment (e.g., a full struct/class/function definition), not character-truncated text
  • embedding text: A structured summary (name / signature / comment, etc.) optimized for vector retrieval
  • metadata: trpc_ast_* fields (type / full_name / language / file_path, etc.) for precise filtering and locating

Below is an example chunk for a Go struct:

content:
// Server is a tRPC server.
// One process, one server. A server may offer one or more services.
type Server struct {
    MaxCloseWaitTime time.Duration
    services         map[string]Service
    ...
}

embedding text:
{"name": "Server", "signature": "type Server struct", "type": "Struct",
 "full_name": "trpc.group/trpc-go/trpc-go/server.Server",
 "comment": "Server is a tRPC server. ..."}

metadata:
trpc_ast_type: Struct
trpc_ast_full_name: trpc.group/trpc-go/trpc-go/server.Server
trpc_ast_signature: type Server struct
trpc_ast_language: go
trpc_ast_file_path: server/server.go
trpc_ast_repo_name: trpc-go

For Python files, chunking follows the same approach at Class / Function / Method granularity; for .proto files, it chunks by service / rpc / message / enum.

Code Graph (GraphRAG)

Example Code: examples/knowledge/features/graphrag

Beyond vector retrieval (embedding + vector store), the repo source also supports storing structural code relationships as a graph. AST readers extract edges between entities during parsing. Combined with a graph database (Apache AGE), this enables structural code navigation.

AGE Graph Viewer - Code Graph Visualization

Edge Types

Edge Type Meaning Example
CALLS Function/method call mainserver.Start
METHOD Method of a class/struct ServerServer.Start
FIELD Struct field Serverservices
PARAM Function parameter NewServeropts
RETURNS Function return type NewServerServer
INHERITS Inheritance/implementation MyRunnerBaseRunner
CONTAINS Containment package serverServer

Usage

import (
    _ "trpc.group/trpc-go/trpc-agent-go/knowledge/document/reader/golang"
    _ "trpc.group/trpc-go/trpc-agent-go/knowledge/document/reader/python"
    "trpc.group/trpc-go/trpc-agent-go/knowledge"
    agegraphstore "trpc.group/trpc-go/trpc-agent-go/knowledge/graphstore/age"
    "trpc.group/trpc-go/trpc-agent-go/knowledge/source/repo"
    "trpc.group/trpc-go/trpc-agent-go/knowledge/vectorstore/pgvector"
    knowledgetool "trpc.group/trpc-go/trpc-agent-go/knowledge/tool"
)

// 1. Configure repo source
repoSrc := repo.New(
    repo.WithRepository(repo.Repository{URL: "https://github.com/example/repo"}),
    repo.WithFileExtensions([]string{".go", ".py"}),
)

// 2. Create GraphKnowledge (graph + vector hybrid retrieval)
ageStore, err := agegraphstore.New(
    agegraphstore.WithClientDSN(ageDSN),
    agegraphstore.WithGraphName("my_graph"),
)
if err != nil {
    return err
}
vectorStore, err := pgvector.New(
    pgvector.WithPGVectorClientDSN(pgvectorDSN),
    pgvector.WithTable("graph_vectors"),
    pgvector.WithIndexDimension(1536),
)
if err != nil {
    return err
}
gk := knowledge.NewGraphKnowledge(
    knowledge.WithGraphStore(ageStore),
    knowledge.WithGraphVectorStore(vectorStore),
    knowledge.WithGraphEmbedder(embedder),
)

// 3. Load graph data
gk.LoadGraphSource(ctx, repoSrc)

// 4. Inject graph tools into the Agent
toolSet := knowledgetool.NewCodeGraphSearchTool(gk)
// Exposes code_graph_search / code_graph_traverse / code_graph_find_paths

Agent Graph Tools

Tool Function
code_graph_search Vector search for AST nodes, returns matching code entities
code_graph_traverse Traverse related nodes from a given start node along edges (e.g., find all callers of a function)
code_graph_find_paths Find paths between two code entities (e.g., trace a call chain)

By locating entry nodes through vector search and then exploring structural relationships via graph traversal, the Agent can understand code architecture without reading large amounts of source code.

Combined Usage

import (
    "trpc.group/trpc-go/trpc-agent-go/knowledge"
    openaiembedder "trpc.group/trpc-go/trpc-agent-go/knowledge/embedder/openai"
    "trpc.group/trpc-go/trpc-agent-go/knowledge/source"
    filesource "trpc.group/trpc-go/trpc-agent-go/knowledge/source/file"
    dirsource "trpc.group/trpc-go/trpc-agent-go/knowledge/source/dir"
    urlsource "trpc.group/trpc-go/trpc-agent-go/knowledge/source/url"
    autosource "trpc.group/trpc-go/trpc-agent-go/knowledge/source/auto"
    vectorinmemory "trpc.group/trpc-go/trpc-agent-go/knowledge/vectorstore/inmemory"
)

// Combine multiple sources
sources := []source.Source{fileSrc, dirSrc, urlSrc, autoSrc}

embedder := openaiembedder.New(openaiembedder.WithModel("text-embedding-3-small"))
vectorStore := vectorinmemory.New()

// Pass to Knowledge
kb := knowledge.New(
    knowledge.WithEmbedder(embedder),
    knowledge.WithVectorStore(vectorStore),
    knowledge.WithSources(sources),
)

// Load all sources
if err := kb.Load(ctx); err != nil {
    log.Fatalf("Failed to load knowledge base: %v", err)
}

Chunking Strategy

Example Code: fixed-chunking | recursive-chunking

Chunking is the process of splitting long documents into smaller fragments, which is crucial for vector retrieval. The framework provides multiple built-in chunking strategies and supports custom strategies.

Built-in Chunking Strategies

Strategy Description Use Case
FixedSizeChunking Fixed-size chunking General text, simple and fast
RecursiveChunking Recursive chunking by separator hierarchy Preserving semantic integrity
MarkdownChunking Chunk by Markdown structure Markdown documents (default)
JSONChunking Chunk by JSON structure JSON files (default)

Default Behavior

Each file type has an associated chunking strategy:

  • .md files → MarkdownChunking (recursively chunk by heading levels H1→H6→paragraph→fixed size)
  • .json files → JSONChunking (chunk by JSON structure)
  • .txt/.csv/.docx etc. → FixedSizeChunking

Default Parameters:

Parameter Default Description
ChunkSize 1024 Maximum characters per chunk
Overlap 128 Overlapping characters between adjacent chunks

Default chunking strategies are affected by chunkSize parameter. The overlap parameter only applies to FixedSizeChunking, RecursiveChunking, and MarkdownChunking. JSONChunking does not support overlap.

Adjust default strategy parameters via WithChunkSize and WithChunkOverlap:

1
2
3
4
5
fileSrc := filesource.New(
    []string{"./data/document.txt"},
    filesource.WithChunkSize(512),     // Chunk size (characters)
    filesource.WithChunkOverlap(64),   // Chunk overlap (characters)
)

Custom Chunking Strategy

Use WithCustomChunkingStrategy to override the default chunking strategy.

Note: Custom chunking strategy completely overrides WithChunkSize and WithChunkOverlap configurations. Chunking parameters must be set within the custom strategy.

FixedSizeChunking - Fixed Size Chunking

Splits text by fixed character count with overlap support:

import (
    "trpc.group/trpc-go/trpc-agent-go/knowledge/chunking"
    filesource "trpc.group/trpc-go/trpc-agent-go/knowledge/source/file"
)

// Create fixed-size chunking strategy
fixedChunking := chunking.NewFixedSizeChunking(
    chunking.WithChunkSize(512),   // Max 512 characters per chunk
    chunking.WithOverlap(64),      // 64 characters overlap between chunks
)

fileSrc := filesource.New(
    []string{"./data/document.md"},
    filesource.WithCustomChunkingStrategy(fixedChunking),
)

RecursiveChunking - Recursive Chunking

Recursively splits by separator hierarchy, preferring natural boundaries:

import (
    "trpc.group/trpc-go/trpc-agent-go/knowledge/chunking"
    filesource "trpc.group/trpc-go/trpc-agent-go/knowledge/source/file"
)

// Create recursive chunking strategy
recursiveChunking := chunking.NewRecursiveChunking(
    chunking.WithRecursiveChunkSize(512),   // Max chunk size
    chunking.WithRecursiveOverlap(64),      // Overlap between chunks
    // Custom separator priority (optional)
    chunking.WithRecursiveSeparators([]string{"\n\n", "\n", ". ", " "}),
)

fileSrc := filesource.New(
    []string{"./data/article.txt"},
    filesource.WithCustomChunkingStrategy(recursiveChunking),
)

Separator Priority Explanation:

  1. \n\n - First try to split by paragraph
  2. \n - Then split by line
  3. . - Then split by sentence
  4. - Split by space

Recursive chunking attempts to use higher priority separators, only using the next level separator when chunks still exceed the maximum size. If all separators fail to split text within chunkSize, it will force split by chunkSize.

Configuring Metadata

To enable filter functionality, it's recommended to add rich metadata when creating document sources.

For detailed filter usage guide, please refer to Filter Documentation.

sources := []source.Source{
    // File source with metadata
    filesource.New(
        []string{"./docs/api.md"},
        filesource.WithName("API Documentation"),
        filesource.WithMetadataValue("category", "documentation"),
        filesource.WithMetadataValue("topic", "api"),
        filesource.WithMetadataValue("service_type", "gateway"),
        filesource.WithMetadataValue("protocol", "trpc-go"),
        filesource.WithMetadataValue("version", "v1.0"),
    ),

    // Directory source with metadata
    dirsource.New(
        []string{"./tutorials"},
        dirsource.WithName("Tutorials"),
        dirsource.WithMetadataValue("category", "tutorial"),
        dirsource.WithMetadataValue("difficulty", "beginner"),
        dirsource.WithMetadataValue("topic", "programming"),
    ),

    // URL source with metadata
    urlsource.New(
        []string{"https://example.com/wiki/rpc"},
        urlsource.WithName("RPC Wiki"),
        urlsource.WithMetadataValue("category", "encyclopedia"),
        urlsource.WithMetadataValue("source_type", "web"),
        urlsource.WithMetadataValue("topic", "rpc"),
        urlsource.WithMetadataValue("language", "zh"),
    ),
}

Content Transformer

Example Code: examples/knowledge/features/transform

Transformer is used to preprocess and postprocess content before and after document chunking. This is particularly useful for cleaning text extracted from PDFs, web pages, and other sources, removing excess whitespace, duplicate characters, and other noise.

Processing Flow

Document → Preprocess → Processed Document → Chunking → Chunks → Postprocess → Final Chunks

Built-in Transformers

CharFilter - Character Filter

Removes specified characters or strings:

1
2
3
4
import "trpc.group/trpc-go/trpc-agent-go/knowledge/transform"

// Remove newlines, tabs, and carriage returns
filter := transform.NewCharFilter("\n", "\t", "\r")

CharDedup - Character Deduplicator

Merges consecutive duplicate characters or strings into a single instance:

1
2
3
4
5
6
7
8
import "trpc.group/trpc-go/trpc-agent-go/knowledge/transform"

// Merge multiple consecutive spaces into one, merge multiple newlines into one
dedup := transform.NewCharDedup(" ", "\n")

// Example:
// Input:  "hello     world\n\n\nfoo"
// Output: "hello world\nfoo"

Usage

Transformers are passed to various document sources via the WithTransformers option:

import (
    filesource "trpc.group/trpc-go/trpc-agent-go/knowledge/source/file"
    dirsource "trpc.group/trpc-go/trpc-agent-go/knowledge/source/dir"
    urlsource "trpc.group/trpc-go/trpc-agent-go/knowledge/source/url"
    autosource "trpc.group/trpc-go/trpc-agent-go/knowledge/source/auto"
    "trpc.group/trpc-go/trpc-agent-go/knowledge/transform"
)

// Create transformers
filter := transform.NewCharFilter("\t")           // Remove tabs
dedup := transform.NewCharDedup(" ", "\n")        // Merge consecutive spaces and newlines

// File source with transformers
fileSrc := filesource.New(
    []string{"./data/document.pdf"},
    filesource.WithTransformers(filter, dedup),
)

// Directory source with transformers
dirSrc := dirsource.New(
    []string{"./docs"},
    dirsource.WithTransformers(filter, dedup),
)

// URL source with transformers
urlSrc := urlsource.New(
    []string{"https://example.com/article"},
    urlsource.WithTransformers(filter, dedup),
)

// Auto source with transformers
autoSrc := autosource.New(
    []string{"./mixed-content"},
    autosource.WithTransformers(filter, dedup),
)

Combining Multiple Transformers

Multiple transformers are executed in sequence:

1
2
3
4
5
6
7
8
// First remove tabs, then merge consecutive spaces
filter := transform.NewCharFilter("\t")
dedup := transform.NewCharDedup(" ")

src := filesource.New(
    []string{"./data/messy.txt"},
    filesource.WithTransformers(filter, dedup),  // Executed in order
)

Typical Use Cases

Scenario Recommended Configuration
PDF text cleanup CharDedup(" ", "\n") - Merge excess spaces and newlines from PDF extraction
Web content processing CharFilter("\t") + CharDedup(" ") - Remove tabs and merge spaces
Code documentation processing CharDedup("\n") - Merge excess blank lines, preserve code indentation
General text cleanup CharFilter("\r") + CharDedup(" ", "\n") - Remove carriage returns and merge whitespace

PDF File Support

Since the PDF reader depends on third-party libraries, to avoid introducing unnecessary dependencies in the main module, the PDF reader uses a separate go.mod.

To support PDF file reading, manually import the PDF reader package in your code:

1
2
3
4
import (
    // Import PDF reader to support .pdf file parsing
    _ "trpc.group/trpc-go/trpc-agent-go/knowledge/document/reader/pdf"
)

Note: Readers for other formats (.txt/.md/.csv/.json, etc.) are automatically registered and don't need manual import.