Rag Accuracy Optimizer

ClawSkills 作者 eddieluong v1.3.0

Optimize accuracy for RAG (Retrieval-Augmented Generation) systems. Covers: DB schema design, chunking strategies, retrieval optimization, accuracy testing, and anti-hallucination safeguards. Use when: (1) designing or improving a RAG pipeline, (2) choosing the right chunking strategy, (3) optimizing retrieval accuracy (hybrid search, reranking, multi-query), (4) evaluating chunk quality or testing accuracy, (5) setting up monitoring & safeguards for RAG production, (6) choosing SQL vs Vector DB, (7) designing metadata schemas for domain-specific data (insurance, finance, healthcare, e-commerce).

源码 ↗

安装 / 下载方式

TotalClaw CLI推荐

totalclaw install clawskills:eddieluong~rag-accuracy-optimizer

cURL直接下载，无需登录

curl -fsSL https://skills.taituai.com/api/skills/clawskills%3Aeddieluong~rag-accuracy-optimizer/file -o rag-accuracy-optimizer.md

Git 仓库获取源码

git clone https://github.com/openclaw/skills/commit/630b2d267027daf58bebf28b1516846d0eaf2069

# RAG Accuracy Optimizer

A skill for optimizing end-to-end accuracy in RAG systems.

## Workflow Overview

```
Data Design → Chunking → Indexing → Retrieval → Generation → Testing → Monitoring
```

Each step impacts accuracy. Optimize each step in order.

---

## 1. Structured Data Design

### SQL vs Vector DB — When to Use What?

| Criteria | SQL (PostgreSQL, MySQL) | Vector DB (Pinecone, Qdrant, Weaviate) |
|---|---|---|
| Exact facts (price, date, product code) | ✅ Optimal | ❌ Not suitable |
| Semantic search (query meaning) | ❌ Not supported | ✅ Optimal |
| Aggregation (SUM, COUNT, AVG) | ✅ Native | ❌ Not supported |
| Fuzzy matching ("similar to...") | ⚠️ Limited | ✅ Optimal |
| **Hybrid (recommended)** | pgvector for both | Vector DB + SQL metadata store |

**Principle:** Clearly structured data → SQL. Unstructured data requiring semantic understanding → Vector DB. Most production systems need **both**.

### Schema Design Patterns by Domain

**Insurance:**
```
policies(policy_id, product_type, effective_date)
clauses(clause_id, policy_id, clause_number, title, content)
exclusions(exclusion_id, clause_id, description)
-- Vector: embedding for clause.content + exclusion.description
```

**Finance:**
```
securities(ticker, name, sector, exchange)
reports(report_id, ticker, period, report_type)
sections(section_id, report_id, heading, content)
-- Vector: embedding for section.content, metadata: ticker + period
```

**Healthcare:**
```
drugs(drug_id, generic_name, brand_name, category)
guidelines(guideline_id, condition, recommendation, evidence_level)
interactions(drug_a_id, drug_b_id, severity, description)
-- Vector: embedding for guidelines.recommendation
```

**E-commerce:**
```
products(product_id, name, category, brand, price)
reviews(review_id, product_id, rating, content)
specs(product_id, attribute, value)
-- Vector: embedding for review.content + product description
```

### Metadata Tagging Strategy

Each chunk/document needs at minimum:

```python
metadata = {
    "source": "policy_doc_v2.pdf",       # Origin
    "source_type": "pdf",                 # File type
    "domain": "insurance",                # Domain
    "category": "life_insurance",          # Classification
    "entity_id": "POL-2024-001",          # Related entity ID
    "section": "exclusions",              # Section in doc
    "chunk_index": 3,                      # Chunk position
    "total_chunks": 12,                    # Total chunks in doc
    "created_at": "2024-01-15",           # Creation date
    "version": "2.0",                      # Version
    "language": "en"                       # Language
}
```

**Metadata principles:**
- Always include `source` for traceability and citation
- `entity_id` enables pre-filtering before search → reduces noise
- `chunk_index` + `total_chunks` enables fetching surrounding context
- Domain-specific fields (clause_number, ticker, drug_id) vary by use case

### Normalization vs Denormalization

| | Normalized | Denormalized |
|---|---|---|
| Pros | Less duplication, easy to update | Faster queries, fewer JOINs |
| Cons | Requires JOINs, slower | Duplication, harder to sync |
| **Use when** | Source of truth (SQL) | Vector store chunks |

**Recommendation:** Normalized for SQL source → Denormalized when creating chunks for Vector DB. Each chunk should contain sufficient context, no JOINs needed at retrieval time.

---

## 2. Chunking Strategies

> Detailed code examples: read `references/chunking-patterns.md`

### Choosing the Right Strategy

```
Data has clear structure (clauses, sections)?
  → Semantic chunking (by heading/section)

Long, continuous data (articles, transcripts)?
  → Fixed size + overlap (512 tokens, 10-20% overlap)

Need both overview + detail?
  → Hierarchical chunking (parent-child)

Domain-specific with its own logical units?
  → Domain-specific chunking
```

### Chunk Size Guidelines

| Size | Use case | Trade-off |
|---|---|---|
| 128-256 tokens | FAQ, short definitions | High precision, less context |
| 256-512 tokens | **Recommended default** | Good balance |
| 512-1024 tokens | Complex text, legal docs | More context, potential noise |
| >1024 tokens | Rarely used | Too much noise |

### Semantic Chunking

Split by meaning (section, topic) instead of fixed size:

```python
# Split by markdown headings
# Split by paragraph breaks (\n\n)
# Split by topic change (using NLP or LLM detection)
```

### Overlap Strategy

- **10-20% overlap** between adjacent chunks
- Ensures information at boundaries is not lost
- Chunk N ends with 1-2 opening sentences of chunk N+1

### Hierarchical Chunking (Parent-Child)

```
Document (summary)
  └── Section (heading + key points)
        └── Paragraph (details)
```

- Search at paragraph level (most detailed)
- When matched, pull parent section for additional context
- Keep `parent_id` in metadata

### Domain-Specific Chunking

- **Insurance:** 1 chunk = 1 clause
- **Finance:** 1 chunk = 1 report section, metadata = ticker + period
- **Healthcare:** 1 chunk = 1 guideline/recommendation
- **E-commerce:** 1 chunk = 1 review or 1 product description
- **Legal:** 1 chunk = 1 article/clause/section

### Metadata Enrichment Per Chunk

Each chunk should be enriched with:
- **Summary:** 1-2 sentence content summary (LLM-generated)
- **Keywords:** Key terms (supports BM25)
- **Questions:** 2-3 questions this chunk can answer (hypothetical questions)
- **Entities:** Named entities (product names, codes, dates)

---

## 3. Retrieval Optimization

> Detailed code examples: read `references/retrieval-patterns.md`

### Recommended Retrieval Pipeline

```
User Query
  → Query Rewriting (expand/reformulate)
  → Multi-Query Generation (3-5 variants)
  → Metadata Filtering (narrow scope)
  → Hybrid Search (Vector + BM25)
  → Merge & Deduplicate
  → Reranking (top 20 → top 5)
  → Contextual Compression
  → LLM Generation (with citations)
```

### Hybrid Search (Vector + BM25)

- **Vector search:** Find by meaning (semantic similarity)
- **BM25 (keyword):** Find by exact keywords (product names, codes)
- **Combined:** Weighted fusion or Reciprocal Rank Fusion (RRF)

```
final_score = α × vector_score + (1-α) × bm25_score
# α = 0.7 is a good starting point, tune per domain
```

### Query Rewriting

Use LLM to reformulate the user question for clarity:

```
User: "does insurance pay?"
→ Rewritten: "Under what circumstances does life insurance pay out benefits?"
```

### Multi-Query

From 1 question, generate 3-5 variants → search each variant → merge results:

```
Original: "Which bank has the highest savings rate?"
Query 1: "Compare savings interest rates across banks 2024"
Query 2: "Bank with highest deposit rate currently"
Query 3: "Top banks with best deposit interest rates"
```

### Reranking

After retrieval, use a reranking model to re-sort by relevance:

- **Cohere Rerank:** Simple API, highly effective
- **Cross-encoder:** More accurate than bi-encoder, but slower
- **GPT Rerank:** Use LLM to evaluate relevance (expensive but flexible)

Retrieve top 20 → rerank → take top 3-5 for generation.

### Contextual Compression

After reranking, compress each chunk: keep only the part relevant to the question.

```
Original chunk (500 tokens) → Compressed (150 tokens, relevant part only)
```

Reduces noise, saves context window, improves accuracy.

### Metadata Filtering

Narrow the search space BEFORE vector search:

```python
# Instead of searching all 1M chunks:
filter = {"domain": "insurance", "product_type": "life"}
# Only search within ~50K relevant chunks
results = vector_db.search(query, filter=filter, top_k=20)
```

---

## 4. Accuracy Testing & Monitoring

### Test Suite Design

Create ground truth Q&A pairs:

```json
{
    "test_cases": [
        {
            "question": "Does life insurance pay out for suicide?",
            "expected_answer": "No payout within the first 2 years",
            "expected_source": "clause_15_exclusions.pdf",
            "category": "exclusions",