Rag Accuracy Optimizer
优化 RAG(检索增强生成)系统的准确性。 涵盖:数据库模式设计、分块策略、检索优化、 准确性测试和抗幻觉保障措施。使用时:(1)设计 或改进 RAG 管道,(2) 选择正确的分块策略,(3) 优化 检索精度(混合搜索、重排序、多查询),(4) 评估块 质量或测试准确性,(5) 为 RAG 建立监控和保障措施 生产,(6) 选择 SQL 与 Vector DB,(7) 设计元数据模式 特定领域的数据(保险、金融、医疗保健、电子商务)。
安装 / 下载方式
TotalClaw CLI推荐
totalclaw install totalclaw:eddieluong~rag-accuracy-optimizercURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Aeddieluong~rag-accuracy-optimizer/file -o rag-accuracy-optimizer.mdGit 仓库获取源码
git clone https://github.com/openclaw/skills/commit/630b2d267027daf58bebf28b1516846d0eaf2069## 概述(中文)
优化 RAG(检索增强生成)系统的准确性。
涵盖:数据库模式设计、分块策略、检索优化、
准确性测试和抗幻觉保障措施。使用时:(1)设计
或改进 RAG 管道,(2) 选择正确的分块策略,(3) 优化
检索精度(混合搜索、重排序、多查询),(4) 评估块
质量或测试准确性,(5) 为 RAG 建立监控和保障措施
生产,(6) 选择 SQL 与 Vector DB,(7) 设计元数据模式
特定领域的数据(保险、金融、医疗保健、电子商务)。
## 原文
# RAG Accuracy Optimizer
A skill for optimizing end-to-end accuracy in RAG systems.
## Workflow Overview
```
Data Design → Chunking → Indexing → Retrieval → Generation → Testing → Monitoring
```
Each step impacts accuracy. Optimize each step in order.
---
## 1. Structured Data Design
### SQL vs Vector DB — When to Use What?
| Criteria | SQL (PostgreSQL, MySQL) | Vector DB (Pinecone, Qdrant, Weaviate) |
|---|---|---|
| Exact facts (price, date, product code) | ✅ Optimal | ❌ Not suitable |
| Semantic search (query meaning) | ❌ Not supported | ✅ Optimal |
| Aggregation (SUM, COUNT, AVG) | ✅ Native | ❌ Not supported |
| Fuzzy matching ("similar to...") | ⚠️ Limited | ✅ Optimal |
| **Hybrid (recommended)** | pgvector for both | Vector DB + SQL metadata store |
**Principle:** Clearly structured data → SQL. Unstructured data requiring semantic understanding → Vector DB. Most production systems need **both**.
### Schema Design Patterns by Domain
**Insurance:**
```
policies(policy_id, product_type, effective_date)
clauses(clause_id, policy_id, clause_number, title, content)
exclusions(exclusion_id, clause_id, description)
-- Vector: embedding for clause.content + exclusion.description
```
**Finance:**
```
securities(ticker, name, sector, exchange)
reports(report_id, ticker, period, report_type)
sections(section_id, report_id, heading, content)
-- Vector: embedding for section.content, metadata: ticker + period
```
**Healthcare:**
```
drugs(drug_id, generic_name, brand_name, category)
guidelines(guideline_id, condition, recommendation, evidence_level)
interactions(drug_a_id, drug_b_id, severity, description)
-- Vector: embedding for guidelines.recommendation
```
**E-commerce:**
```
products(product_id, name, category, brand, price)
reviews(review_id, product_id, rating, content)
specs(product_id, attribute, value)
-- Vector: embedding for review.content + product description
```
### Metadata Tagging Strategy
Each chunk/document needs at minimum:
```python
metadata = {
"source": "policy_doc_v2.pdf", # Origin
"source_type": "pdf", # File type
"domain": "insurance", # Domain
"category": "life_insurance", # Classification
"entity_id": "POL-2024-001", # Related entity ID
"section": "exclusions", # Section in doc
"chunk_index": 3, # Chunk position
"total_chunks": 12, # Total chunks in doc
"created_at": "2024-01-15", # Creation date
"version": "2.0", # Version
"language": "en" # Language
}
```
**Metadata principles:**
- Always include `source` for traceability and citation
- `entity_id` enables pre-filtering before search → reduces noise
- `chunk_index` + `total_chunks` enables fetching surrounding context
- Domain-specific fields (clause_number, ticker, drug_id) vary by use case
### Normalization vs Denormalization
| | Normalized | Denormalized |
|---|---|---|
| Pros | Less duplication, easy to update | Faster queries, fewer JOINs |
| Cons | Requires JOINs, slower | Duplication, harder to sync |
| **Use when** | Source of truth (SQL) | Vector store chunks |
**Recommendation:** Normalized for SQL source → Denormalized when creating chunks for Vector DB. Each chunk should contain sufficient context, no JOINs needed at retrieval time.
---
## 2. Chunking Strategies
> Detailed code examples: read `references/chunking-patterns.md`
### Choosing the Right Strategy
```
Data has clear structure (clauses, sections)?
→ Semantic chunking (by heading/section)
Long, continuous data (articles, transcripts)?
→ Fixed size + overlap (512 tokens, 10-20% overlap)
Need both overview + detail?
→ Hierarchical chunking (parent-child)
Domain-specific with its own logical units?
→ Domain-specific chunking
```
### Chunk Size Guidelines
| Size | Use case | Trade-off |
|---|---|---|
| 128-256 tokens | FAQ, short definitions | High precision, less context |
| 256-512 tokens | **Recommended default** | Good balance |
| 512-1024 tokens | Complex text, legal docs | More context, potential noise |
| >1024 tokens | Rarely used | Too much noise |
### Semantic Chunking
Split by meaning (section, topic) instead of fixed size:
```python
# Split by markdown headings
# Split by paragraph breaks (\n\n)
# Split by topic change (using NLP or LLM detection)
```
### Overlap Strategy
- **10-20% overlap** between adjacent chunks
- Ensures information at boundaries is not lost
- Chunk N ends with 1-2 opening sentences of chunk N+1
### Hierarchical Chunking (Parent-Child)
```
Document (summary)
└── Section (heading + key points)
└── Paragraph (details)
```
- Search at paragraph level (most detailed)
- When matched, pull parent section for additional context
- Keep `parent_id` in metadata
### Domain-Specific Chunking
- **Insurance:** 1 chunk = 1 clause
- **Finance:** 1 chunk = 1 report section, metadata = ticker + period
- **Healthcare:** 1 chunk = 1 guideline/recommendation
- **E-commerce:** 1 chunk = 1 review or 1 product description
- **Legal:** 1 chunk = 1 article/clause/section
### Metadata Enrichment Per Chunk
Each chunk should be enriched with:
- **Summary:** 1-2 sentence content summary (LLM-generated)
- **Keywords:** Key terms (supports BM25)
- **Questions:** 2-3 questions this chunk can answer (hypothetical questions)
- **Entities:** Named entities (product names, codes, dates)
---
## 3. Retrieval Optimization
> Detailed code examples: read `references/retrieval-patterns.md`
### Recommended Retrieval Pipeline
```
User Query
→ Query Rewriting (expand/reformulate)
→ Multi-Query Generation (3-5 variants)
→ Metadata Filtering (narrow scope)
→ Hybrid Search (Vector + BM25)
→ Merge & Deduplicate
→ Reranking (top 20 → top 5)
→ Contextual Compression
→ LLM Generation (with citations)
```
### Hybrid Search (Vector + BM25)
- **Vector search:** Find by meaning (semantic similarity)
- **BM25 (keyword):** Find by exact keywords (product names, codes)
- **Combined:** Weighted fusion or Reciprocal Rank Fusion (RRF)
```
final_score = α × vector_score + (1-α) × bm25_score
# α = 0.7 is a good starting point, tune per domain
```
### Query Rewriting
Use LLM to reformulate the user question for clarity:
```
User: "does insurance pay?"
→ Rewritten: "Under what circumstances does life insurance pay out benefits?"
```
### Multi-Query
From 1 question, generate 3-5 variants → search each variant → merge results:
```
Original: "Which bank has the highest savings rate?"
Query 1: "Compare savings interest rates across banks 2024"
Query 2: "Bank with highest deposit rate currently"
Query 3: "Top banks with best deposit interest rates"
```
### Reranking
After retrieval, use a reranking model to re-sort by relevance:
- **Cohere Rerank:** Simple API, highly effective
- **Cross-encoder:** More accurate than bi-encoder, but slower
- **GPT Rerank:** Use LLM to evaluate relevance (expensive but flexible)
Retrieve top 20 → rerank → take top 3-5 for generation.
### Contextual Compression
After reranking, compress each chunk: keep only the part relevant to the question.
```
Original chunk (500 tokens) → Compressed (150 tokens, relevant part only)
```
Reduces noise, saves context window, improves accuracy.
### Metadata Filtering
Narrow the search space BEFORE vector search:
```python
# Instead of searching all 1M chunks:
filter = {"domain": "insurance", "product_type": "life"}
# Only search within ~50K relevant chunks
results = vector_db.search(query, filter=filter, top_k=20)
```
---
## 4. Accuracy Testing & Monitoring
### Test Suite Design
Create ground truth Q&A pairs:
```json
{
"test_cases": [