Spreadsheet & Data Wrangling Master

ClawSkills 作者 1kalin v1.0.0

Complete spreadsheet methodology — data cleanup, transformation, analysis, dashboards, automation, and reporting. Works with CSV, Excel, Google Sheets, or any tabular data. Use when the user needs to clean messy data, build reports, create dashboards, automate recurring spreadsheet tasks, or transform data between formats.

源码 ↗

安装 / 下载方式

TotalClaw CLI推荐
totalclaw install clawskills:1kalin~afrexai-spreadsheet-master
cURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/clawskills%3A1kalin~afrexai-spreadsheet-master/file -o afrexai-spreadsheet-master.md
Git 仓库获取源码
git clone https://github.com/openclaw/skills/commit/852e3916b14e69d0d75dd2cae59f2c103415e877
# Spreadsheet & Data Wrangling Master

> Turn messy data into clean insights, automated reports, and decision-ready dashboards. Platform-agnostic — works with CSV, Excel, Google Sheets, or any tabular format.

---

## Phase 1: Data Intake & Assessment

Before touching any data, assess what you have.

### Quick Health Check (score /20)

```yaml
data_intake:
  source: ""           # file path, URL, API, database, manual entry
  format: ""           # CSV, XLSX, TSV, JSON, clipboard paste
  rows: 0
  columns: 0
  file_size: ""
  encoding: ""         # UTF-8, Latin-1, Windows-1252, etc.
  delimiter: ""        # comma, tab, pipe, semicolon

  health_score:        # rate each 0-4, total /20
    completeness: 0    # 4=<1% missing, 3=<5%, 2=<15%, 1=<30%, 0=>30%
    consistency: 0     # 4=uniform types, 3=minor mixed, 2=significant mixed, 1=chaotic, 0=unusable
    accuracy: 0        # 4=verified, 3=plausible, 2=some outliers, 1=many errors, 0=untrustworthy
    freshness: 0       # 4=real-time, 3=<24h, 2=<7d, 1=<30d, 0=stale/unknown
    structure: 0       # 4=tidy (1 row=1 obs), 3=minor reshaping, 2=pivot needed, 1=multi-header, 0=freeform

  issues_found: []     # list every problem before fixing anything
```

### First 10 Questions to Ask

1. How many rows and columns?
2. What does each row represent? (one customer? one transaction? one day?)
3. Are there header rows? Multiple header rows? Merged cells?
4. What are the data types? (dates, currencies, percentages, IDs, free text)
5. How much is missing? Which columns?
6. Are there duplicates? By which key?
7. Is there a unique identifier column?
8. What date format? (MM/DD/YYYY vs DD/MM/YYYY vs YYYY-MM-DD vs mixed)
9. What currency/number format? (1,000.00 vs 1.000,00 vs 1000)
10. Where did this data come from and how often is it updated?

---

## Phase 2: Data Cleaning Decision Tree

### Step-by-Step Cleaning Protocol

```
START
  │
  ├─ Headers → Normalize (lowercase, snake_case, no spaces/special chars)
  │
  ├─ Duplicates?
  │   ├─ Exact duplicates → Remove, keep first
  │   ├─ Near-duplicates → Flag for review (fuzzy match on name + address)
  │   └─ Intentional duplicates → Leave (e.g., multiple orders same customer)
  │
  ├─ Missing Values?
  │   ├─ <5% of column → Fill (mean for numeric, mode for categorical, forward-fill for time series)
  │   ├─ 5-30% → Flag + fill with "UNKNOWN" or interpolate with justification
  │   ├─ >30% → Consider dropping column or flagging as unreliable
  │   └─ Entire row missing key fields → Remove with log
  │
  ├─ Data Types?
  │   ├─ Dates as text → Parse to date (try multiple formats, log failures)
  │   ├─ Numbers as text → Strip currency symbols, commas, whitespace, convert
  │   ├─ IDs/zips with leading zeros → Keep as text (NEVER convert to number)
  │   ├─ Phone numbers → Text, standardize format
  │   ├─ Mixed types in column → Split or coerce with error log
  │   └─ Boolean variants → Map (Yes/No/True/False/1/0/Y/N → consistent)
  │
  ├─ Outliers?
  │   ├─ Calculate IQR: Q1 - 1.5×IQR to Q3 + 1.5×IQR
  │   ├─ Business logic check (negative revenue? age 200? date in 2099?)
  │   ├─ Decide: fix (typo), cap (winsorize), remove, or keep with flag
  │   └─ ALWAYS log which outliers were modified and why
  │
  ├─ Standardization?
  │   ├─ Text case → Consistent (Title Case for names, UPPER for codes)
  │   ├─ Whitespace → Trim leading/trailing, collapse internal
  │   ├─ Categories → Map variants ("US"/"USA"/"United States" → "US")
  │   ├─ Dates → ISO 8601 (YYYY-MM-DD) internally
  │   ├─ Currency → Consistent symbol placement, decimal precision
  │   └─ Phone/email → Validate format
  │
  └─ Structural Issues?
      ├─ Multi-header rows → Flatten to single header
      ├─ Merged cells → Unmerge + fill down
      ├─ Pivot/crosstab → Unpivot to tidy format (1 row = 1 observation)
      ├─ Multiple tables in one sheet → Split to separate sheets/files
      └─ Metadata rows (totals, notes) → Separate from data rows
```

### Cleaning Log Template

```yaml
cleaning_log:
  date: "YYYY-MM-DD"
  source_file: ""
  rows_before: 0
  rows_after: 0
  actions:
    - action: "removed exact duplicates"
      rows_affected: 0
      key_columns: ["email"]
    - action: "filled missing values"
      column: "state"
      method: "mode"
      values_filled: 0
    - action: "removed outliers"
      column: "revenue"
      criteria: "negative values"
      rows_removed: 0
    - action: "standardized dates"
      column: "order_date"
      from_format: "MM/DD/YYYY"
      to_format: "YYYY-MM-DD"
      parse_failures: 0
  notes: ""
```

---

## Phase 3: Transformation Patterns

### 12 Essential Transform Operations

| # | Operation | When to Use | Example |
|---|-----------|-------------|---------|
| 1 | **Filter** | Subset rows by condition | Orders > $1000, Date after 2024-01-01 |
| 2 | **Sort** | Order by column(s) | Revenue descending, then date ascending |
| 3 | **Group + Aggregate** | Summarize by category | Total revenue by region, avg order by customer |
| 4 | **Pivot** | Rows → columns | Monthly columns from date rows |
| 5 | **Unpivot/Melt** | Columns → rows | Month columns back to date rows |
| 6 | **Join/Merge** | Combine datasets | Customer data + order data on customer_id |
| 7 | **Deduplicate** | Remove redundancy | Keep latest record per customer |
| 8 | **Derive** | Calculate new columns | profit = revenue - cost, age = today - birthdate |
| 9 | **Split** | One column → many | "John Smith" → first_name, last_name |
| 10 | **Concatenate** | Many columns → one | city + state + zip → full_address |
| 11 | **Lookup/Map** | Enrich with reference data | state_code → state_name, product_id → category |
| 12 | **Window** | Running calculations | 7-day moving average, rank within group, running total |

### Join Strategy Decision Guide

```
Which join do you need?
│
├─ Need ALL rows from left table → LEFT JOIN
│   (customers who may or may not have orders)
│
├─ Need ONLY matching rows → INNER JOIN
│   (only customers WITH orders)
│
├─ Need ALL rows from both → FULL OUTER JOIN
│   (reconciliation: find mismatches)
│
├─ Need everything NOT in other table → LEFT JOIN + WHERE right IS NULL
│   (customers who NEVER ordered)
│
└─ Need every combination → CROSS JOIN (rare, use carefully)
    (all products × all stores for pricing matrix)

⚠️ ALWAYS check join results:
- Row count: did it explode? (many-to-many join)
- Row count: did it shrink? (keys not matching)
- NULL columns: expected from outer join, unexpected = key mismatch
```

### Formula Reference (Cross-Platform)

| Task | Excel | Google Sheets | Python (pandas) |
|------|-------|---------------|-----------------|
| Lookup | `VLOOKUP`, `XLOOKUP` | `VLOOKUP`, `XLOOKUP` | `df.merge()`, `df.map()` |
| Conditional sum | `SUMIFS` | `SUMIFS` | `df.groupby().sum()` |
| Conditional count | `COUNTIFS` | `COUNTIFS` | `df.groupby().count()` |
| Text split | `TEXTSPLIT`, `LEFT/MID/RIGHT` | `SPLIT` | `df.str.split()` |
| Date diff | `DATEDIF`, math | `DATEDIF` | `(df.col2 - df.col1).dt.days` |
| Running total | `SUM($A$1:A1)` | `SUM($A$1:A1)` | `df.cumsum()` |
| Rank | `RANK.EQ` | `RANK` | `df.rank()` |
| Percent of total | `=A1/SUM($A:$A)` | `=A1/SUM($A:$A)` | `df.col / df.col.sum()` |
| Remove duplicates | Data → Remove Duplicates | Data → Remove Duplicates | `df.drop_duplicates()` |
| Pivot | Pivot Table | Pivot Table | `df.pivot_table()` |

---

## Phase 4: Analysis Frameworks

### Quick Analysis Menu

Pick the analysis that matches the question:

**Descriptive (What happened?)**
- Summary statistics: count, mean, median, min, max, std dev, percentiles
- Frequency distributions: how many of each category?
- Time trends: daily/weekly/monthly aggregates over time
- Cross-tabs: category A × category B breakdown

**Diagnostic (Why did it happen?)**
- Drill-down: which segment drove the change?
- Cohort analysis: behavior by signup month
- Correlation: which variables move together?
- Variance analysis: actual vs budget/forecast, by category