Spreadsheet & Data Wrangling Master
完整的电子表格方法——数据清理、转换、分析、仪表板、自动化和报告。适用于 CSV、Excel、Google 表格或任何表格数据。当用户需要清理混乱的数据、构建报告、创建仪表板、自动执行重复的电子表格任务或在格式之间转换数据时使用。
安装 / 下载方式
TotalClaw CLI推荐
totalclaw install totalclaw:1kalin~afrexai-spreadsheet-mastercURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3A1kalin~afrexai-spreadsheet-master/file -o afrexai-spreadsheet-master.mdGit 仓库获取源码
git clone https://github.com/openclaw/skills/commit/852e3916b14e69d0d75dd2cae59f2c103415e877## 概述(中文)
完整的电子表格方法——数据清理、转换、分析、仪表板、自动化和报告。适用于 CSV、Excel、Google 表格或任何表格数据。当用户需要清理混乱的数据、构建报告、创建仪表板、自动执行重复的电子表格任务或在格式之间转换数据时使用。
## 原文
# Spreadsheet & Data Wrangling Master
> Turn messy data into clean insights, automated reports, and decision-ready dashboards. Platform-agnostic — works with CSV, Excel, Google Sheets, or any tabular format.
---
## Phase 1: Data Intake & Assessment
Before touching any data, assess what you have.
### Quick Health Check (score /20)
```yaml
data_intake:
source: "" # file path, URL, API, database, manual entry
format: "" # CSV, XLSX, TSV, JSON, clipboard paste
rows: 0
columns: 0
file_size: ""
encoding: "" # UTF-8, Latin-1, Windows-1252, etc.
delimiter: "" # comma, tab, pipe, semicolon
health_score: # rate each 0-4, total /20
completeness: 0 # 4=<1% missing, 3=<5%, 2=<15%, 1=<30%, 0=>30%
consistency: 0 # 4=uniform types, 3=minor mixed, 2=significant mixed, 1=chaotic, 0=unusable
accuracy: 0 # 4=verified, 3=plausible, 2=some outliers, 1=many errors, 0=untrustworthy
freshness: 0 # 4=real-time, 3=<24h, 2=<7d, 1=<30d, 0=stale/unknown
structure: 0 # 4=tidy (1 row=1 obs), 3=minor reshaping, 2=pivot needed, 1=multi-header, 0=freeform
issues_found: [] # list every problem before fixing anything
```
### First 10 Questions to Ask
1. How many rows and columns?
2. What does each row represent? (one customer? one transaction? one day?)
3. Are there header rows? Multiple header rows? Merged cells?
4. What are the data types? (dates, currencies, percentages, IDs, free text)
5. How much is missing? Which columns?
6. Are there duplicates? By which key?
7. Is there a unique identifier column?
8. What date format? (MM/DD/YYYY vs DD/MM/YYYY vs YYYY-MM-DD vs mixed)
9. What currency/number format? (1,000.00 vs 1.000,00 vs 1000)
10. Where did this data come from and how often is it updated?
---
## Phase 2: Data Cleaning Decision Tree
### Step-by-Step Cleaning Protocol
```
START
│
├─ Headers → Normalize (lowercase, snake_case, no spaces/special chars)
│
├─ Duplicates?
│ ├─ Exact duplicates → Remove, keep first
│ ├─ Near-duplicates → Flag for review (fuzzy match on name + address)
│ └─ Intentional duplicates → Leave (e.g., multiple orders same customer)
│
├─ Missing Values?
│ ├─ <5% of column → Fill (mean for numeric, mode for categorical, forward-fill for time series)
│ ├─ 5-30% → Flag + fill with "UNKNOWN" or interpolate with justification
│ ├─ >30% → Consider dropping column or flagging as unreliable
│ └─ Entire row missing key fields → Remove with log
│
├─ Data Types?
│ ├─ Dates as text → Parse to date (try multiple formats, log failures)
│ ├─ Numbers as text → Strip currency symbols, commas, whitespace, convert
│ ├─ IDs/zips with leading zeros → Keep as text (NEVER convert to number)
│ ├─ Phone numbers → Text, standardize format
│ ├─ Mixed types in column → Split or coerce with error log
│ └─ Boolean variants → Map (Yes/No/True/False/1/0/Y/N → consistent)
│
├─ Outliers?
│ ├─ Calculate IQR: Q1 - 1.5×IQR to Q3 + 1.5×IQR
│ ├─ Business logic check (negative revenue? age 200? date in 2099?)
│ ├─ Decide: fix (typo), cap (winsorize), remove, or keep with flag
│ └─ ALWAYS log which outliers were modified and why
│
├─ Standardization?
│ ├─ Text case → Consistent (Title Case for names, UPPER for codes)
│ ├─ Whitespace → Trim leading/trailing, collapse internal
│ ├─ Categories → Map variants ("US"/"USA"/"United States" → "US")
│ ├─ Dates → ISO 8601 (YYYY-MM-DD) internally
│ ├─ Currency → Consistent symbol placement, decimal precision
│ └─ Phone/email → Validate format
│
└─ Structural Issues?
├─ Multi-header rows → Flatten to single header
├─ Merged cells → Unmerge + fill down
├─ Pivot/crosstab → Unpivot to tidy format (1 row = 1 observation)
├─ Multiple tables in one sheet → Split to separate sheets/files
└─ Metadata rows (totals, notes) → Separate from data rows
```
### Cleaning Log Template
```yaml
cleaning_log:
date: "YYYY-MM-DD"
source_file: ""
rows_before: 0
rows_after: 0
actions:
- action: "removed exact duplicates"
rows_affected: 0
key_columns: ["email"]
- action: "filled missing values"
column: "state"
method: "mode"
values_filled: 0
- action: "removed outliers"
column: "revenue"
criteria: "negative values"
rows_removed: 0
- action: "standardized dates"
column: "order_date"
from_format: "MM/DD/YYYY"
to_format: "YYYY-MM-DD"
parse_failures: 0
notes: ""
```
---
## Phase 3: Transformation Patterns
### 12 Essential Transform Operations
| # | Operation | When to Use | Example |
|---|-----------|-------------|---------|
| 1 | **Filter** | Subset rows by condition | Orders > $1000, Date after 2024-01-01 |
| 2 | **Sort** | Order by column(s) | Revenue descending, then date ascending |
| 3 | **Group + Aggregate** | Summarize by category | Total revenue by region, avg order by customer |
| 4 | **Pivot** | Rows → columns | Monthly columns from date rows |
| 5 | **Unpivot/Melt** | Columns → rows | Month columns back to date rows |
| 6 | **Join/Merge** | Combine datasets | Customer data + order data on customer_id |
| 7 | **Deduplicate** | Remove redundancy | Keep latest record per customer |
| 8 | **Derive** | Calculate new columns | profit = revenue - cost, age = today - birthdate |
| 9 | **Split** | One column → many | "John Smith" → first_name, last_name |
| 10 | **Concatenate** | Many columns → one | city + state + zip → full_address |
| 11 | **Lookup/Map** | Enrich with reference data | state_code → state_name, product_id → category |
| 12 | **Window** | Running calculations | 7-day moving average, rank within group, running total |
### Join Strategy Decision Guide
```
Which join do you need?
│
├─ Need ALL rows from left table → LEFT JOIN
│ (customers who may or may not have orders)
│
├─ Need ONLY matching rows → INNER JOIN
│ (only customers WITH orders)
│
├─ Need ALL rows from both → FULL OUTER JOIN
│ (reconciliation: find mismatches)
│
├─ Need everything NOT in other table → LEFT JOIN + WHERE right IS NULL
│ (customers who NEVER ordered)
│
└─ Need every combination → CROSS JOIN (rare, use carefully)
(all products × all stores for pricing matrix)
⚠️ ALWAYS check join results:
- Row count: did it explode? (many-to-many join)
- Row count: did it shrink? (keys not matching)
- NULL columns: expected from outer join, unexpected = key mismatch
```
### Formula Reference (Cross-Platform)
| Task | Excel | Google Sheets | Python (pandas) |
|------|-------|---------------|-----------------|
| Lookup | `VLOOKUP`, `XLOOKUP` | `VLOOKUP`, `XLOOKUP` | `df.merge()`, `df.map()` |
| Conditional sum | `SUMIFS` | `SUMIFS` | `df.groupby().sum()` |
| Conditional count | `COUNTIFS` | `COUNTIFS` | `df.groupby().count()` |
| Text split | `TEXTSPLIT`, `LEFT/MID/RIGHT` | `SPLIT` | `df.str.split()` |
| Date diff | `DATEDIF`, math | `DATEDIF` | `(df.col2 - df.col1).dt.days` |
| Running total | `SUM($A$1:A1)` | `SUM($A$1:A1)` | `df.cumsum()` |
| Rank | `RANK.EQ` | `RANK` | `df.rank()` |
| Percent of total | `=A1/SUM($A:$A)` | `=A1/SUM($A:$A)` | `df.col / df.col.sum()` |
| Remove duplicates | Data → Remove Duplicates | Data → Remove Duplicates | `df.drop_duplicates()` |
| Pivot | Pivot Table | Pivot Table | `df.pivot_table()` |
---
## Phase 4: Analysis Frameworks
### Quick Analysis Menu
Pick the analysis that matches the question:
**Descriptive (What happened?)**
- Summary statistics: count, mean, median, min, max, std dev, percentiles
- Frequency distributions: how many of each category?
- Time trends: daily/weekly/monthly aggregates over time
- Cross-tabs: category A × category B breakdown
**Diagnostic (Why did it happen?)**
- Drill-down: which segment drove the change?
- Cohort analys