crm-data-cleaner

TotalClaw 作者 totalclaw

消除重复、标准化并丰富 CRM 联系人和公司。当用户需要清理 CRM 数据、查找重复联系人、标准化电话号码或电子邮件、合并重复记录、审核数据质量或通过 Clearbit 或 Apollo 等外部来源丰富联系人时使用。可与 HubSpot、Salesforce、Pipedrive 或任何具有 CSV 导出功能的 CRM 配合使用。仅指令技能——无脚本或代码执行。所有操作均通过 CRM 平台 API 或 CSV 导出/导入工作流程执行。

安装 / 下载方式

TotalClaw CLI推荐
totalclaw install totalclaw:totalclaw~luigi08001-crm-data-cleaner
cURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~luigi08001-crm-data-cleaner/file -o luigi08001-crm-data-cleaner.md
## 概述(中文)

消除重复、标准化并丰富 CRM 联系人和公司。当用户需要清理 CRM 数据、查找重复联系人、标准化电话号码或电子邮件、合并重复记录、审核数据质量或通过 Clearbit 或 Apollo 等外部来源丰富联系人时使用。可与 HubSpot、Salesforce、Pipedrive 或任何具有 CSV 导出功能的 CRM 配合使用。仅指令技能——无脚本或代码执行。所有操作均通过 CRM 平台 API 或 CSV 导出/导入工作流程执行。

## 原文

# CRM Data Cleaner — Dedup, Normalize & Enrich Contacts

## Overview

Clean, accurate CRM data is the foundation of effective sales and marketing operations. Poor data quality costs businesses an average of $3.1 million annually through wasted time, missed opportunities, and ineffective campaigns. This skill provides comprehensive frameworks, tools, and automation strategies to maintain pristine contact and company data across all major CRM platforms.

This guide covers the three pillars of CRM data hygiene: **Deduplication** (removing duplicate records), **Normalization** (standardizing data formats), and **Enrichment** (filling missing information with reliable external sources).

## Table of Contents

1. [Understanding Data Quality Issues](#understanding-data-quality-issues)
2. [Deduplication Strategy](#deduplication-strategy)
3. [Data Normalization](#data-normalization)
4. [Data Enrichment](#data-enrichment)
5. [Platform-Specific Implementation](#platform-specific-implementation)
6. [Automation and Monitoring](#automation-and-monitoring)
7. [Maintenance and Governance](#maintenance-and-governance)
8. [Advanced Techniques](#advanced-techniques)

## Understanding Data Quality Issues

### Common CRM Data Problems

**Duplicate Records (30-40% of databases)**
- Multiple entries for same person/company
- Slight variations in name spelling
- Different email addresses for same contact
- Incomplete vs. complete records

**Inconsistent Formatting**
- Phone numbers: (555) 123-4567 vs 555-123-4567 vs +1.555.123.4567
- Company names: "IBM Corp" vs "International Business Machines Corporation"
- Addresses: "St" vs "Street", "CA" vs "California"
- Job titles: "VP Marketing" vs "Vice President of Marketing"

**Missing Information**
- 60% of B2B contacts missing phone numbers
- 40% missing company information
- 35% missing job titles
- 25% missing complete addresses

**Outdated Information**
- Contact changes jobs (20% annually)
- Email addresses become invalid (25% every 2 years)
- Phone numbers change
- Companies reorganize, merge, or rebrand

### Impact on Business Operations

**Sales Productivity Loss**
- 27% of sales time spent on data entry and management
- Missed follow-ups due to duplicate records
- Confusion over primary contact information
- Difficulty identifying decision makers

**Marketing Campaign Inefficiency**
- Increased email bounce rates
- Multiple messages to same recipient
- Poor segmentation due to incomplete data
- Inaccurate reporting and attribution

**Customer Experience Issues**
- Multiple sales reps contacting same prospect
- Conflicting information across touchpoints
- Poor personalization due to incomplete profiles
- Frustration from repeated information requests

### Data Quality Assessment Framework

**Completeness Score**
- Required fields populated: Target 95%
- Optional fields populated: Target 70%
- Critical fields (email, company): Target 98%

**Accuracy Score**
- Valid email format: Target 99%
- Valid phone format: Target 95%
- Verified addresses: Target 90%

**Consistency Score**
- Standardized formatting: Target 95%
- Consistent naming conventions: Target 90%
- Aligned data across systems: Target 95%

**Uniqueness Score**
- Duplicate contact rate: Target <2%
- Duplicate company rate: Target <1%
- Clean merge history: Target 100%

## Deduplication Strategy

### Types of Duplicates

**Exact Duplicates**
- Identical records with same key fields
- Usually caused by import errors
- Easy to identify and merge automatically

**Near Duplicates**
- Similar but not identical records
- Name variations: "Bob Smith" vs "Robert Smith"
- Email variations: personal vs business emails
- Require fuzzy matching algorithms

**Company Duplicates**
- Same company with different names
- "Apple Inc" vs "Apple Computer" vs "Apple"
- Subsidiary vs parent company confusion
- Domain-based matching challenges

**Household/Account Duplicates**
- Multiple contacts at same company
- Family members at same address
- Different roles but same organization

### Duplicate Detection Methods

#### Primary Key Matching
**Email-Based Matching** (Most Reliable)
```
Match Criteria:
- Exact email match = 100% duplicate probability
- Domain + similar names = 85% probability
- Multiple emails for same person = merge candidates
```

**Phone-Based Matching** (Secondary)
```
Match Criteria:
- Exact phone match + similar name = 90% probability
- Same phone, different names = investigate
- Multiple formats of same number = normalize first
```

**Name + Company Matching** (Fuzzy)
```
Match Criteria:
- Exact name + exact company = 95% probability
- Similar name + exact company = 80% probability
- Exact name + similar company = 70% probability
```

#### Advanced Matching Algorithms

**Levenshtein Distance**
- Measures character differences between strings
- Useful for typos and variations
- Example: "Smith" vs "Smyth" = distance of 1

**Soundex Matching**
- Phonetic matching algorithm
- Groups similar-sounding names
- Example: "Smith", "Smyth", "Smithe" = same soundex

**Token Matching**
- Breaks names into components
- Matches individual parts
- Example: "John Michael Smith" matches "J.M. Smith"

### Deduplication Workflow

#### Phase 1: Automated Detection

**High-Confidence Matches (90%+ probability)**
- Exact email matches
- Identical phone + similar names
- Same LinkedIn profile URLs
- Automatic flagging for review

**Medium-Confidence Matches (60-89% probability)**
- Similar names + same company
- Name variations + same domain
- Fuzzy phone number matches
- Queue for manual review

**Low-Confidence Matches (40-59% probability)**
- Loose name similarities
- Possible company matches
- Require detailed investigation

#### Phase 2: Manual Review Process

**Review Queue Prioritization**
1. High-value accounts (enterprise clients)
2. Active opportunities
3. Recent activity (last 30 days)
4. Marketing qualified leads
5. Bulk import suspects

**Review Criteria Checklist**
- [ ] Same person confirmation
- [ ] Most complete record identification
- [ ] Activity history preservation
- [ ] Integration considerations
- [ ] Sales team notifications needed

#### Phase 3: Merge Execution

**Pre-Merge Validation**
- Backup critical data
- Identify master record
- Map fields to preserve
- Note dependencies (campaigns, workflows)

**Field Merge Rules**
- Primary email: Business > Personal > Most recent
- Phone: Mobile > Direct line > Main number
- Address: Most complete > Most recent
- Job title: Most senior > Most recent
- Company: Most complete > Most recent

**Post-Merge Cleanup**
- Update related records
- Refresh reports and lists
- Notify affected team members
- Document merge decisions

### Platform-Specific Deduplication

#### HubSpot Deduplication

**Native Duplicate Management**
- Automatic duplicate detection
- Merge suggestions in contacts view
- Bulk merge capabilities
- Activity history preservation

**Custom Duplicate Rules**
```
Email + Company Domain matching
Name similarity + Phone matching
LinkedIn URL exact matching
Custom property combinations
```

**API-Based Deduplication**
```python
# Example HubSpot duplicate detection
import requests

def find_hubspot_duplicates(api_key, batch_size=100):
    url = f"https://api.hubapi.com/contacts/v1/lists/all/contacts/all"
    params = {
        'hapikey': api_key,
        'count': batch_size,
        'property': ['email', 'firstname', 'lastname', 'company']
    }
    # Implementation details in scripts/
```

#### Salesforce Deduplication

**Duplicate Rules Setup**
- Standard duplicate rules (Lead/Contact)
- Custom matching rules
- Automatic alerts vs blocking
- Duplicate job monitoring

**Third-Party Tools**
- Duplicate Check by CRM Science
- Cloudingo duplicate management
- DemandTools by Vali