openocr-skills

ClawSkills 作者 openocr v0.1.4

Extract text from images, documents and scanned PDFs using OpenOCR - supports text detection, recognition, universal VLM recognition, and document parsing with layout analysis

安装 / 下载方式

TotalClaw CLI推荐
totalclaw install clawskills:clawskills~topdu-opencr-skill
cURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/clawskills%3Aclawskills~topdu-opencr-skill/file -o topdu-opencr-skill.md
# OpenOCR Skill

## Overview

This skill enables intelligent text extraction, document parsing, and universal recognition using **OpenOCR** - an accurate and efficient general OCR system. It provides a unified interface for text detection, text recognition, end-to-end OCR, VLM-based universal recognition (text/formulas/tables), and document parsing with layout analysis. Supports Chinese, English, and more.

## How to Use

1. Provide the image, scanned document, or PDF
2. Optionally specify the task type (det/rec/ocr/unirec/doc)
3. I'll extract text, formulas, tables, or full document structure

**Example prompts:**
- "Extract all text from this image"
- "Detect text regions in this photo"
- "Recognize the formula in this screenshot"
- "Parse this PDF document with layout analysis"
- "Convert this scanned page to Markdown"

## Domain Knowledge

### OpenOCR Fundamentals

```python
from openocr import OpenOCR

# Initialize with a specific task
engine = OpenOCR(task='ocr')

# Run OCR on an image (callable interface)
results, time_dicts = engine(image_path='image.jpg')

# Results contain detected boxes with recognized text
for result in results:
    for line in result:
        box = line[0]       # Bounding box coordinates
        text = line[1][0]   # Recognized text
        conf = line[1][1]   # Confidence score
        print(f"{text} ({conf:.2f})")
```

### Supported Tasks

```python
# Available task types
tasks = {
    'det':    'Text Detection - detect text regions with bounding boxes',
    'rec':    'Text Recognition - recognize text from cropped images',
    'ocr':    'End-to-End OCR - detection + recognition pipeline',
    'unirec': 'Universal Recognition - VLM-based text/formula/table recognition (0.1B params)',
    'doc':    'Document Parsing - layout analysis + universal recognition (0.1B params)',
}

# Task selection via parameter
det_engine = OpenOCR(task='det')
rec_engine = OpenOCR(task='rec')
ocr_engine = OpenOCR(task='ocr')
unirec_engine = OpenOCR(task='unirec')
doc_engine = OpenOCR(task='doc')
```

### Configuration Options

```python
from openocr import OpenOCR

# === Text Detection ===
detector = OpenOCR(
    task='det',
    backend='onnx',                          # 'onnx' (default) or 'torch'
    onnx_det_model_path=None,                # Custom detection model (auto-downloads if None)
    use_gpu='auto',                          # 'auto', 'true', or 'false'
)

# === Text Recognition ===
recognizer = OpenOCR(
    task='rec',
    mode='mobile',                           # 'mobile' (fast) or 'server' (accurate)
    backend='onnx',                          # 'onnx' (default) or 'torch'
    onnx_rec_model_path=None,                # Custom recognition model
    use_gpu='auto',
)

# === End-to-End OCR ===
ocr = OpenOCR(
    task='ocr',
    mode='mobile',                           # 'mobile' or 'server'
    backend='onnx',                          # 'onnx' or 'torch'
    onnx_det_model_path=None,                # Custom detection model
    onnx_rec_model_path=None,                # Custom recognition model
    drop_score=0.5,                          # Confidence threshold for filtering
    det_box_type='quad',                     # 'quad' or 'poly' (for curved text)
    use_gpu='auto',
)

# === Universal Recognition (UniRec) ===
unirec = OpenOCR(
    task='unirec',
    unirec_encoder_path=None,                # Custom encoder ONNX model
    unirec_decoder_path=None,                # Custom decoder ONNX model
    tokenizer_mapping_path=None,             # Custom tokenizer mapping JSON
    max_length=2048,                         # Max generation length
    auto_download=True,                      # Auto-download missing models
    use_gpu='auto',
)

# === Document Parsing (OpenDoc) ===
doc = OpenOCR(
    task='doc',
    layout_model_path=None,                  # Custom layout detection model (PP-DocLayoutV2)
    unirec_encoder_path=None,                # Custom UniRec encoder
    unirec_decoder_path=None,                # Custom UniRec decoder
    tokenizer_mapping_path=None,             # Custom tokenizer mapping
    layout_threshold=0.5,                    # Layout detection threshold
    use_layout_detection=True,               # Enable layout analysis
    max_parallel_blocks=4,                   # Max parallel VLM blocks
    auto_download=True,                      # Auto-download missing models
    use_gpu='auto',
)
```

### Task-Specific Usage

#### Text Detection
```python
from openocr import OpenOCR

detector = OpenOCR(task='det', backend='onnx')

# Detect text regions
results = detector(image_path='image.jpg')

boxes = results[0]['boxes']      # np.ndarray of bounding boxes
elapse = results[0]['elapse']    # Processing time in seconds

print(f"Found {len(boxes)} text regions in {elapse:.3f}s")
for box in boxes:
    print(f"  Box: {box.tolist()}")
```

#### Text Recognition
```python
from openocr import OpenOCR

# Mobile mode (fast, ONNX)
recognizer = OpenOCR(task='rec', mode='mobile', backend='onnx')

# Server mode (accurate, requires torch)
# recognizer = OpenOCR(task='rec', mode='server', backend='torch')

results = recognizer(image_path='word.jpg', batch_num=1)

text = results[0]['text']        # Recognized text string
score = results[0]['score']      # Confidence score
elapse = results[0]['elapse']    # Processing time

print(f"Text: {text}, Score: {score:.3f}, Time: {elapse:.3f}s")
```

#### End-to-End OCR
```python
from openocr import OpenOCR

ocr = OpenOCR(task='ocr', mode='mobile', backend='onnx')

# Run OCR with visualization
results, time_dicts = ocr(
    image_path='image.jpg',
    save_dir='./output',
    is_visualize=True,
    rec_batch_num=6,
)

# Process results
for result in results:
    for line in result:
        box, (text, confidence) = line[0], line[1]
        print(f"{text} ({confidence:.2f})")
```

#### Universal Recognition (UniRec)
```python
from openocr import OpenOCR

unirec = OpenOCR(task='unirec')

# Image input
result_text, generated_ids = unirec(image_path='formula.jpg', max_length=2048)
print(f"Result: {result_text}")

# PDF input (returns list of tuples, one per page)
results = unirec(image_path='document.pdf', max_length=2048)
for page_text, page_ids in results:
    print(f"Page: {page_text[:100]}...")
```

#### Document Parsing (OpenDoc)
```python
from openocr import OpenOCR

doc = OpenOCR(task='doc', use_layout_detection=True)

# Parse a document image
result = doc(image_path='document.jpg')

# Save outputs in multiple formats
doc.save_to_markdown(result, './output')
doc.save_to_json(result, './output')
doc.save_visualization(result, './output')

# Parse a PDF (returns list of dicts, one per page)
results = doc(image_path='document.pdf')
for page_result in results:
    doc.save_to_markdown(page_result, './output')
```

### Command-Line Interface

```bash
# Text Detection
openocr --task det --input_path image.jpg --is_vis

# Text Recognition
openocr --task rec --input_path word.jpg --mode server --backend torch

# End-to-End OCR
openocr --task ocr --input_path image.jpg --is_vis --output_path ./results

# Universal Recognition
openocr --task unirec --input_path formula.jpg --max_length 2048

# Document Parsing
openocr --task doc --input_path document.pdf \
    --use_layout_detection --save_vis --save_json --save_markdown

# Launch Gradio Demos
openocr --task launch_openocr_demo --share --server_port 7860
openocr --task launch_unirec_demo --share --server_port 7861
openocr --task launch_opendoc_demo --share --server_port 7862
```

### Processing Different Sources

#### Image Files
```python
from openocr import OpenOCR

ocr = OpenOCR(task='ocr')

# Single image
results, _ = ocr(image_path='image.jpg')

# Directory of images
results, _ = ocr(image_path='./images/', save_dir='./output', is_visualize=True)
```

#### PDF Files
```python
from openocr import OpenOCR

# UniRec handles PDFs natively
unirec = OpenOCR(task='unirec')
results = unirec(image_path='document.pdf', max_length=2048)

# OpenDoc han