openocr-skills
使用 OpenOCR 从图像、文档和扫描的 PDF 中提取文本 - 支持文本检测、识别、通用 VLM 识别以及通过布局分析进行文档解析
安装 / 下载方式
TotalClaw CLI推荐
totalclaw install totalclaw:totalclaw~topdu-opencr-skillcURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~topdu-opencr-skill/file -o topdu-opencr-skill.md## 概述(中文)
使用 OpenOCR 从图像、文档和扫描的 PDF 中提取文本 - 支持文本检测、识别、通用 VLM 识别以及通过布局分析进行文档解析
## 原文
# OpenOCR Skill
## Overview
This skill enables intelligent text extraction, document parsing, and universal recognition using **OpenOCR** - an accurate and efficient general OCR system. It provides a unified interface for text detection, text recognition, end-to-end OCR, VLM-based universal recognition (text/formulas/tables), and document parsing with layout analysis. Supports Chinese, English, and more.
## How to Use
1. Provide the image, scanned document, or PDF
2. Optionally specify the task type (det/rec/ocr/unirec/doc)
3. I'll extract text, formulas, tables, or full document structure
**Example prompts:**
- "Extract all text from this image"
- "Detect text regions in this photo"
- "Recognize the formula in this screenshot"
- "Parse this PDF document with layout analysis"
- "Convert this scanned page to Markdown"
## Domain Knowledge
### OpenOCR Fundamentals
```python
from openocr import OpenOCR
# Initialize with a specific task
engine = OpenOCR(task='ocr')
# Run OCR on an image (callable interface)
results, time_dicts = engine(image_path='image.jpg')
# Results contain detected boxes with recognized text
for result in results:
for line in result:
box = line[0] # Bounding box coordinates
text = line[1][0] # Recognized text
conf = line[1][1] # Confidence score
print(f"{text} ({conf:.2f})")
```
### Supported Tasks
```python
# Available task types
tasks = {
'det': 'Text Detection - detect text regions with bounding boxes',
'rec': 'Text Recognition - recognize text from cropped images',
'ocr': 'End-to-End OCR - detection + recognition pipeline',
'unirec': 'Universal Recognition - VLM-based text/formula/table recognition (0.1B params)',
'doc': 'Document Parsing - layout analysis + universal recognition (0.1B params)',
}
# Task selection via parameter
det_engine = OpenOCR(task='det')
rec_engine = OpenOCR(task='rec')
ocr_engine = OpenOCR(task='ocr')
unirec_engine = OpenOCR(task='unirec')
doc_engine = OpenOCR(task='doc')
```
### Configuration Options
```python
from openocr import OpenOCR
# === Text Detection ===
detector = OpenOCR(
task='det',
backend='onnx', # 'onnx' (default) or 'torch'
onnx_det_model_path=None, # Custom detection model (auto-downloads if None)
use_gpu='auto', # 'auto', 'true', or 'false'
)
# === Text Recognition ===
recognizer = OpenOCR(
task='rec',
mode='mobile', # 'mobile' (fast) or 'server' (accurate)
backend='onnx', # 'onnx' (default) or 'torch'
onnx_rec_model_path=None, # Custom recognition model
use_gpu='auto',
)
# === End-to-End OCR ===
ocr = OpenOCR(
task='ocr',
mode='mobile', # 'mobile' or 'server'
backend='onnx', # 'onnx' or 'torch'
onnx_det_model_path=None, # Custom detection model
onnx_rec_model_path=None, # Custom recognition model
drop_score=0.5, # Confidence threshold for filtering
det_box_type='quad', # 'quad' or 'poly' (for curved text)
use_gpu='auto',
)
# === Universal Recognition (UniRec) ===
unirec = OpenOCR(
task='unirec',
unirec_encoder_path=None, # Custom encoder ONNX model
unirec_decoder_path=None, # Custom decoder ONNX model
tokenizer_mapping_path=None, # Custom tokenizer mapping JSON
max_length=2048, # Max generation length
auto_download=True, # Auto-download missing models
use_gpu='auto',
)
# === Document Parsing (OpenDoc) ===
doc = OpenOCR(
task='doc',
layout_model_path=None, # Custom layout detection model (PP-DocLayoutV2)
unirec_encoder_path=None, # Custom UniRec encoder
unirec_decoder_path=None, # Custom UniRec decoder
tokenizer_mapping_path=None, # Custom tokenizer mapping
layout_threshold=0.5, # Layout detection threshold
use_layout_detection=True, # Enable layout analysis
max_parallel_blocks=4, # Max parallel VLM blocks
auto_download=True, # Auto-download missing models
use_gpu='auto',
)
```
### Task-Specific Usage
#### Text Detection
```python
from openocr import OpenOCR
detector = OpenOCR(task='det', backend='onnx')
# Detect text regions
results = detector(image_path='image.jpg')
boxes = results[0]['boxes'] # np.ndarray of bounding boxes
elapse = results[0]['elapse'] # Processing time in seconds
print(f"Found {len(boxes)} text regions in {elapse:.3f}s")
for box in boxes:
print(f" Box: {box.tolist()}")
```
#### Text Recognition
```python
from openocr import OpenOCR
# Mobile mode (fast, ONNX)
recognizer = OpenOCR(task='rec', mode='mobile', backend='onnx')
# Server mode (accurate, requires torch)
# recognizer = OpenOCR(task='rec', mode='server', backend='torch')
results = recognizer(image_path='word.jpg', batch_num=1)
text = results[0]['text'] # Recognized text string
score = results[0]['score'] # Confidence score
elapse = results[0]['elapse'] # Processing time
print(f"Text: {text}, Score: {score:.3f}, Time: {elapse:.3f}s")
```
#### End-to-End OCR
```python
from openocr import OpenOCR
ocr = OpenOCR(task='ocr', mode='mobile', backend='onnx')
# Run OCR with visualization
results, time_dicts = ocr(
image_path='image.jpg',
save_dir='./output',
is_visualize=True,
rec_batch_num=6,
)
# Process results
for result in results:
for line in result:
box, (text, confidence) = line[0], line[1]
print(f"{text} ({confidence:.2f})")
```
#### Universal Recognition (UniRec)
```python
from openocr import OpenOCR
unirec = OpenOCR(task='unirec')
# Image input
result_text, generated_ids = unirec(image_path='formula.jpg', max_length=2048)
print(f"Result: {result_text}")
# PDF input (returns list of tuples, one per page)
results = unirec(image_path='document.pdf', max_length=2048)
for page_text, page_ids in results:
print(f"Page: {page_text[:100]}...")
```
#### Document Parsing (OpenDoc)
```python
from openocr import OpenOCR
doc = OpenOCR(task='doc', use_layout_detection=True)
# Parse a document image
result = doc(image_path='document.jpg')
# Save outputs in multiple formats
doc.save_to_markdown(result, './output')
doc.save_to_json(result, './output')
doc.save_visualization(result, './output')
# Parse a PDF (returns list of dicts, one per page)
results = doc(image_path='document.pdf')
for page_result in results:
doc.save_to_markdown(page_result, './output')
```
### Command-Line Interface
```bash
# Text Detection
openocr --task det --input_path image.jpg --is_vis
# Text Recognition
openocr --task rec --input_path word.jpg --mode server --backend torch
# End-to-End OCR
openocr --task ocr --input_path image.jpg --is_vis --output_path ./results
# Universal Recognition
openocr --task unirec --input_path formula.jpg --max_length 2048
# Document Parsing
openocr --task doc --input_path document.pdf \
--use_layout_detection --save_vis --save_json --save_markdown
# Launch Gradio Demos
openocr --task launch_openocr_demo --share --server_port 7860
openocr --task launch_unirec_demo --share --server_port 7861
openocr --task launch_opendoc_demo --share --server_port 7862
```
### Processing Different Sources
#### Image Files
```python
from openocr import OpenOCR
ocr = OpenOCR(task='ocr')
# Single image
results, _ = ocr(image_path='image.jpg')
# Directory of images
results, _ = ocr(image_path='./images/', save_dir='./output', is_visualize=True)
```
#### PDF Files
```python
from openocr import OpenOCR
# UniRec handles PDFs natively
unirec = OpenOCR(tas