openocr-skills

TotalClaw 作者 openocr v0.1.4

使用 OpenOCR 从图像、文档和扫描的 PDF 中提取文本 - a lightweight and efficient OCR system with document parsing model requiring only 0.1B parameters, capable of running recognition on personal PCs. Supports text detection, recognition, universal VLM recognition, and document parsing with layout analysis

安装 / 下载方式

TotalClaw CLI推荐

totalclaw install totalclaw:totalclaw~topdu-openocr-skill

cURL直接下载，无需登录

curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~topdu-openocr-skill/file -o topdu-openocr-skill.md

## 概述（中文）

使用 OpenOCR 从图像、文档和扫描的 PDF 中提取文本 - a lightweight and efficient OCR system with document parsing model requiring only 0.1B parameters, capable of running recognition on personal PCs. Supports text detection, recognition, universal VLM recognition, and document parsing with layout analysis

## 原文

# OpenOCR Skill

## Overview

This skill enables intelligent text extraction, document parsing, and universal recognition using **OpenOCR** - an accurate and efficient general OCR system. It provides a unified interface for text detection, text recognition, end-to-end OCR, VLM-based universal recognition (text/formulas/tables), and document parsing with layout analysis. Supports Chinese, English, and more.

## How to Use

1. Provide the image, scanned document, or PDF
2. Optionally specify the task type (det/rec/ocr/unirec/doc)
3. I'll extract text, formulas, tables, or full document structure

**Example prompts:**

- "Extract all text from this image"
- "Detect text regions in this photo"
- "Recognize the formula in this screenshot"
- "Parse this PDF document with layout analysis"
- "Convert this scanned page to Markdown"

## Domain Knowledge

### OpenOCR Fundamentals

```python
from openocr import OpenOCR

# Initialize with a specific task
engine = OpenOCR(task='ocr')

# Run OCR on an image (callable interface)
results, time_dicts = engine(image_path='image.jpg')

# Results contain detected boxes with recognized text
for result in results:
    for line in result:
        box = line[0]       # Bounding box coordinates
        text = line[1][0]   # Recognized text
        conf = line[1][1]   # Confidence score
        print(f"{text} ({conf:.2f})")
```

### Supported Tasks

```python
# Available task types
tasks = {
    'det':    'Text Detection - detect text regions with bounding boxes',
    'rec':    'Text Recognition - recognize text from cropped images',
    'ocr':    'End-to-End OCR - detection + recognition pipeline',
    'unirec': 'Universal Recognition - VLM-based text/formula/table recognition (0.1B params)',
    'doc':    'Document Parsing - layout analysis + universal recognition (0.1B params)',
}

# Task selection via parameter
det_engine = OpenOCR(task='det')
rec_engine = OpenOCR(task='rec')
ocr_engine = OpenOCR(task='ocr')
unirec_engine = OpenOCR(task='unirec')
doc_engine = OpenOCR(task='doc')
```

### Configuration Options

```python
from openocr import OpenOCR

# === Text Detection ===
detector = OpenOCR(
    task='det',
    backend='onnx',                          # 'onnx' (default) or 'torch'
    onnx_det_model_path=None,                # Custom detection model (auto-downloads if None)
    use_gpu='auto',                          # 'auto', 'true', or 'false'
)

# === Text Recognition ===
recognizer = OpenOCR(
    task='rec',
    mode='mobile',                           # 'mobile' (fast) or 'server' (accurate)
    backend='onnx',                          # 'onnx' (default) or 'torch'
    onnx_rec_model_path=None,                # Custom recognition model
    use_gpu='auto',
)

# === End-to-End OCR ===
ocr = OpenOCR(
    task='ocr',
    mode='mobile',                           # 'mobile' or 'server'
    backend='onnx',                          # 'onnx' or 'torch'
    onnx_det_model_path=None,                # Custom detection model
    onnx_rec_model_path=None,                # Custom recognition model
    drop_score=0.5,                          # Confidence threshold for filtering
    det_box_type='quad',                     # 'quad' or 'poly' (for curved text)
    use_gpu='auto',
)

# === Universal Recognition (UniRec) ===
unirec = OpenOCR(
    task='unirec',
    unirec_encoder_path=None,                # Custom encoder ONNX model
    unirec_decoder_path=None,                # Custom decoder ONNX model
    tokenizer_mapping_path=None,             # Custom tokenizer mapping JSON
    max_length=2048,                         # Max generation length
    auto_download=True,                      # Auto-download missing models
    use_gpu='auto',
)

# === Document Parsing (OpenDoc) ===
doc = OpenOCR(
    task='doc',
    layout_model_path=None,                  # Custom layout detection model (PP-DocLayoutV2)
    unirec_encoder_path=None,                # Custom UniRec encoder
    unirec_decoder_path=None,                # Custom UniRec decoder
    tokenizer_mapping_path=None,             # Custom tokenizer mapping
    layout_threshold=0.5,                    # Layout detection threshold
    use_layout_detection=True,               # Enable layout analysis
    max_parallel_blocks=4,                   # Max parallel VLM blocks
    auto_download=True,                      # Auto-download missing models
    use_gpu='auto',
)
```

### Task-Specific Usage

#### Text Detection

```python
from openocr import OpenOCR

detector = OpenOCR(task='det', backend='onnx')

# Detect text regions
results = detector(image_path='image.jpg')

boxes = results[0]['boxes']      # np.ndarray of bounding boxes
elapse = results[0]['elapse']    # Processing time in seconds

print(f"Found {len(boxes)} text regions in {elapse:.3f}s")
for box in boxes:
    print(f"  Box: {box.tolist()}")
```

#### Text Recognition

```python
from openocr import OpenOCR

# Mobile mode (fast, ONNX)
recognizer = OpenOCR(task='rec', mode='mobile', backend='onnx')

# Server mode (accurate, requires torch)
# recognizer = OpenOCR(task='rec', mode='server', backend='torch')

results = recognizer(image_path='word.jpg', batch_num=1)

text = results[0]['text']        # Recognized text string
score = results[0]['score']      # Confidence score
elapse = results[0]['elapse']    # Processing time

print(f"Text: {text}, Score: {score:.3f}, Time: {elapse:.3f}s")
```

#### End-to-End OCR

```python
from openocr import OpenOCR

ocr = OpenOCR(task='ocr', mode='mobile', backend='onnx')

# Run OCR with visualization
results, time_dicts = ocr(
    image_path='image.jpg',
    save_dir='./output',
    is_visualize=True,
    rec_batch_num=6,
)

# Process results
for result in results:
    for line in result:
        box, (text, confidence) = line[0], line[1]
        print(f"{text} ({confidence:.2f})")
```

#### Universal Recognition (UniRec)

```python
from openocr import OpenOCR

unirec = OpenOCR(task='unirec')

# Image input
result_text, generated_ids = unirec(image_path='formula.jpg', max_length=2048)
print(f"Result: {result_text}")

# PDF input (returns list of tuples, one per page)
results = unirec(image_path='document.pdf', max_length=2048)
for page_text, page_ids in results:
    print(f"Page: {page_text[:100]}...")
```

#### Document Parsing (OpenDoc)

```python
from openocr import OpenOCR

doc = OpenOCR(task='doc', use_layout_detection=True)

# Parse a document image
result = doc(image_path='document.jpg')

# Save outputs in multiple formats
doc.save_to_markdown(result, './output')
doc.save_to_json(result, './output')
doc.save_visualization(result, './output')

# Parse a PDF (returns list of dicts, one per page)
results = doc(image_path='document.pdf')
for page_result in results:
    doc.save_to_markdown(page_result, './output')
```

### Command-Line Interface

```bash
# Text Detection
openocr --task det --input_path image.jpg --is_vis

# Text Recognition
openocr --task rec --input_path word.jpg --mode server --backend torch

# End-to-End OCR
openocr --task ocr --input_path image.jpg --is_vis --output_path ./results

# Universal Recognition
openocr --task unirec --input_path formula.jpg --max_length 2048

# Document Parsing
openocr --task doc --input_path document.pdf \
    --use_layout_detection --save_vis --save_json --save_markdown

# Launch Gradio Demos
openocr --task launch_openocr_demo --share --server_port 7860
openocr --task launch_unirec_demo --share --server_port 7861
openocr --task launch_opendoc_demo --share --server_port 7862
```

### Processing Different Sources

#### Image Files

```python
from openocr import OpenOCR

ocr = OpenOCR(task='ocr')

# Single image
results, _ = ocr(image_path=