OCR Text Recognition
The Knowledge system supports OCR (Optical Character Recognition) functionality, enabling text extraction from images or scanned PDF documents, greatly expanding the range of data sources for knowledge bases.
The framework currently has built-in support for the Tesseract OCR engine.
Environment Setup
1. Install Tesseract
Before using OCR functionality, you must install the Tesseract engine and its language packs on your system.
Linux (Ubuntu/Debian):
macOS:
Windows: Please download and install from UB-Mannheim/tesseract.
2. Go Build Tag
Since the Tesseract binding uses CGO, to avoid introducing unnecessary dependencies for users who don't use OCR, the OCR functionality is placed under the tesseract build tag.
When running or compiling code that includes OCR functionality, you must add the -tags tesseract flag:
It's also recommended to add build constraints at the beginning of your code files:
Quick Start
Complete Example: examples/knowledge/features/OCR
Basic Usage
Configuration Options
Tesseract Configuration
tesseract.New supports the following configuration options:
| Option | Description | Default |
|---|---|---|
WithLanguage(lang) |
Set recognition language(s), use + to combine multiple languages (e.g., eng+chi_sim) |
"eng" |
WithConfidenceThreshold(score) |
Set minimum confidence threshold (0-100), results below this threshold will be rejected | 60.0 |
WithPageSegMode(mode) |
Set page segmentation mode (PSM 0-13), corresponds to Tesseract's --psm parameter |
3 (fully automatic) |
Source Integration
OCR extractors can be integrated into the following Sources via the WithOCRExtractor option:
- File Source:
filesource.WithOCRExtractor(ocr) - Directory Source:
dirsource.WithOCRExtractor(ocr) - Auto Source:
autosource.WithOCRExtractor(ocr)
When an OCR extractor is configured, the Source will attempt to perform OCR processing on images or pages when handling supported file types (such as PDF).
Notes
- Performance Impact: OCR processing is computationally intensive and will significantly increase document loading time. It's recommended to enable this feature only for document sources that require OCR.
- CGO Dependency: Using OCR functionality will cause the compiled binary to depend on system libraries (
libtesseract). Ensure the deployment environment has the required dependencies installed. - PDF Support: To process PDF files, make sure to import the
knowledge/document/reader/pdfpackage. This Reader will automatically detect image content in PDFs and invoke the OCR engine.