azure-doc-ocr

TotalClaw 作者 totalclaw

使用 Azure 文档智能（以前称为表单识别器）从文档中提取文本和结构化数据。支持 PDF、图像、扫描文档、手写文本、CJK 语言、表格、表格、发票、收据、身份证件、名片和税表。使用预构建的 REST API v4.0 (2024-11-30) 各种文档类型的模型。触发器：OCR、文本提取、Azure 文档智能、PDF OCR、图像 OCR、扫描文档、手写识别、中日韩文字提取、表格提取、发票处理、收据扫描、 ID文档识别、文档解析、表单提取、Azure Form Recognizer

安装 / 下载方式

TotalClaw CLI推荐

totalclaw install totalclaw:totalclaw~li-hongmin-azure-doc-ocr

cURL直接下载，无需登录

curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~li-hongmin-azure-doc-ocr/file -o li-hongmin-azure-doc-ocr.md

## 概述（中文）

使用 Azure 文档智能（以前称为表单识别器）从文档中提取文本和结构化数据。
支持 PDF、图像、扫描文档、手写文本、CJK 语言、表格、表格、发票、
收据、身份证件、名片和税表。使用预构建的 REST API v4.0 (2024-11-30)
各种文档类型的模型。
触发器：OCR、文本提取、Azure 文档智能、PDF OCR、图像 OCR、扫描文档、
手写识别、中日韩文字提取、表格提取、发票处理、收据扫描、
ID文档识别、文档解析、表单提取、Azure Form Recognizer

## 原文

# Azure Document Intelligence OCR

Extract text and structured data from documents using Azure Document Intelligence REST API.

## Quick Start

### 1. Environment Setup

Set your Azure Document Intelligence credentials:

```bash
export AZURE_DOC_INTEL_ENDPOINT="https://your-resource.cognitiveservices.azure.com"
export AZURE_DOC_INTEL_KEY="your-api-key"
```

### 2. Single File OCR

```bash
# Basic text extraction from PDF
python scripts/ocr_extract.py document.pdf

# Extract with layout (tables, structure)
python scripts/ocr_extract.py document.pdf --model prebuilt-layout --format markdown

# Process invoice
python scripts/ocr_extract.py invoice.pdf --model prebuilt-invoice --format json

# OCR from URL
python scripts/ocr_extract.py --url "https://example.com/document.pdf"

# Save output to file
python scripts/ocr_extract.py document.pdf --output result.txt

# Extract specific pages
python scripts/ocr_extract.py document.pdf --pages 1-3,5
```

### 3. Batch Processing

```bash
# Process all documents in a folder
python scripts/batch_ocr.py ./documents/

# Custom output directory and format
python scripts/batch_ocr.py ./documents/ --output-dir ./extracted/ --format markdown

# Use layout model with 8 workers
python scripts/batch_ocr.py ./documents/ --model prebuilt-layout --workers 8

# Filter specific extensions
python scripts/batch_ocr.py ./documents/ --ext .pdf,.png
```

## Model Selection Guide

| Document Type | Recommended Model | Use Case |
|---------------|-------------------|----------|
| General text | `prebuilt-read` | Pure text extraction, any document |
| Structured docs | `prebuilt-layout` | Tables, forms, paragraphs, figures |
| Invoices | `prebuilt-invoice` | Vendor info, line items, totals |
| Receipts | `prebuilt-receipt` | Merchant, items, totals, dates |
| IDs/Passports | `prebuilt-idDocument` | Identity documents |
| Business cards | `prebuilt-businessCard` | Contact information |
| W-2 forms | `prebuilt-tax.us.w2` | US tax documents |
| Insurance cards | `prebuilt-healthInsuranceCard.us` | Health insurance info |

See [references/models.md](references/models.md) for detailed model documentation.

## Supported Input Formats

- **PDF**: `.pdf` (including scanned PDFs)
- **Images**: `.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`
- **URLs**: Direct links to documents

## Output Formats

- **text**: Plain text concatenation of all extracted content
- **markdown**: Structured output with headers and tables (best with layout model)
- **json**: Raw API response with full extraction details

## Features

- **Handwriting Recognition**: Extracts handwritten text alongside printed text
- **CJK Support**: Full support for Chinese, Japanese, Korean characters
- **Table Extraction**: Preserves table structure (use layout model)
- **Multi-page Processing**: Handles documents with multiple pages
- **Concurrent Processing**: Batch script supports parallel processing
- **URL Input**: Process documents directly from URLs

## Environment Variables

| Variable | Required | Description |
|----------|----------|-------------|
| `AZURE_DOC_INTEL_ENDPOINT` | Yes | Azure Document Intelligence endpoint URL |
| `AZURE_DOC_INTEL_KEY` | Yes | API subscription key |

## Error Handling

- Invalid credentials: Check endpoint URL and API key
- Unsupported format: Ensure file extension matches supported types
- Timeout: Large documents may need longer processing (max 300s)
- Rate limiting: Reduce concurrent workers for batch processing

## Examples

### Extract text from scanned PDF

```bash
python scripts/ocr_extract.py scanned_contract.pdf --model prebuilt-read
```

### Process invoices with structured output

```bash
python scripts/ocr_extract.py invoice.pdf --model prebuilt-invoice --format json --output invoice_data.json
```

### Batch process with layout analysis

```bash
python scripts/batch_ocr.py ./reports/ --model prebuilt-layout --format markdown --workers 4
```

### Extract specific pages from large document

```bash
python scripts/ocr_extract.py large_doc.pdf --pages 1,3-5,10 --format text
```