Dataset Finder

SkillDB 作者 anisafifi v0.1.0

Use this skill when users need to search for datasets, download data files, or explore data repositories. Triggers include: requests to "find datasets", "search for data", "download dataset from Kaggle", "get data from Hugging Face", "find ML datasets", or mentions of data repositories like Kaggle, UCI ML Repository, Data.gov, or Hugging Face. Also use for previewing dataset statistics, generating data cards, or discovering datasets for machine learning projects. Requires OpenClawCLI installation from clawhub.ai.

源码 ↗

安装 / 下载方式

TotalClaw CLI推荐
totalclaw install skilldb:anisafifi~dataset-finder
cURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/skilldb%3Aanisafifi~dataset-finder/file -o dataset-finder.md
Git 仓库获取源码
git clone https://github.com/openclaw/skills/commit/44e9c021167052fd61483b7b0da799bd6624b5b8
# Dataset Finder

Search, download, and explore datasets from multiple repositories including Kaggle, Hugging Face, UCI ML Repository, and Data.gov. Preview statistics, generate data cards, and manage datasets for machine learning projects.

⚠️ **Prerequisite:** Install [OpenClawCLI](https://clawhub.ai/) (Windows, MacOS)

**Installation:**
```bash
# Standard installation
pip install kaggle datasets pandas huggingface-hub requests beautifulsoup4

# If you encounter permission errors, use a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install kaggle datasets pandas huggingface-hub requests beautifulsoup4
```

**Never use `--break-system-packages`** as it can damage your system's Python installation.

---

## Quick Reference

| Task | Command |
|------|---------|
| Search Kaggle | `python scripts/dataset.py kaggle search "housing prices"` |
| Download Kaggle dataset | `python scripts/dataset.py kaggle download "username/dataset-name"` |
| Search Hugging Face | `python scripts/dataset.py huggingface search "sentiment"` |
| Download HF dataset | `python scripts/dataset.py huggingface download "dataset-name"` |
| Search UCI ML | `python scripts/dataset.py uci search "classification"` |
| Preview dataset | `python scripts/dataset.py preview dataset.csv` |
| Generate data card | `python scripts/dataset.py datacard dataset.csv --output README.md` |
| List local datasets | `python scripts/dataset.py list` |

---

## Core Features

### 1. Multi-Repository Search

Search across multiple data repositories from a single interface.

**Supported Sources:**
- **Kaggle** - ML competitions and community datasets
- **Hugging Face** - NLP, vision, and audio datasets
- **UCI ML Repository** - Classic ML datasets
- **Data.gov** - US government open data
- **Local** - Manage downloaded datasets

### 2. Dataset Download

Download datasets with automatic format detection.

**Supported formats:**
- CSV, TSV
- JSON, JSONL
- Parquet
- Excel (XLSX, XLS)
- ZIP archives
- HDF5
- Feather

### 3. Dataset Preview

Get quick statistics and insights without loading entire datasets.

**Preview features:**
- Shape (rows × columns)
- Column names and types
- Missing value counts
- Basic statistics (mean, std, min, max)
- Memory usage
- Sample rows

### 4. Data Card Generation

Automatically generate dataset documentation.

**Includes:**
- Dataset description
- Schema information
- Statistics summary
- Usage examples
- License information
- Citation details

---

## Repository-Specific Commands

### Kaggle

Search and download datasets from Kaggle.

**Setup:**
1. Get Kaggle API credentials from https://www.kaggle.com/settings
2. Place `kaggle.json` in `~/.kaggle/` (Linux/Mac) or `%USERPROFILE%\.kaggle\` (Windows)

```bash
# Search datasets
python scripts/dataset.py kaggle search "house prices"

# Search with filters
python scripts/dataset.py kaggle search "NLP" --file-type csv --sort-by hotness

# Download dataset
python scripts/dataset.py kaggle download "zillow/zecon"

# Download specific files
python scripts/dataset.py kaggle download "username/dataset" --file "train.csv"

# List dataset files
python scripts/dataset.py kaggle list "username/dataset-name"
```

**Search options:**
- `--file-type` - Filter by file type (csv, json, etc.)
- `--license` - Filter by license type
- `--sort-by` - Sort by hotness, votes, updated, or relevance
- `--max-results` - Limit number of results

**Output:**
```
1. House Prices - Advanced Regression Techniques
   Owner: zillow/zecon
   Size: 1.5 MB
   Last updated: 2023-06-15
   Downloads: 150,000+
   URL: https://www.kaggle.com/datasets/zillow/zecon

2. Housing Prices Dataset
   Owner: username/housing-data
   Size: 850 KB
   Last updated: 2023-08-20
   Downloads: 50,000+
   URL: https://www.kaggle.com/datasets/username/housing-data
```

### Hugging Face Datasets

Search and download datasets from Hugging Face Hub.

```bash
# Search datasets
python scripts/dataset.py huggingface search "sentiment analysis"

# Search with filters
python scripts/dataset.py huggingface search "NLP" --task text-classification --language en

# Download dataset
python scripts/dataset.py huggingface download "imdb"

# Download specific split
python scripts/dataset.py huggingface download "imdb" --split train

# Download specific configuration
python scripts/dataset.py huggingface download "glue" --config mrpc

# Stream large datasets
python scripts/dataset.py huggingface download "large-dataset" --streaming
```

**Search options:**
- `--task` - Filter by task (text-classification, translation, etc.)
- `--language` - Filter by language code
- `--multimodal` - Include multimodal datasets
- `--benchmark` - Only benchmark datasets
- `--max-results` - Limit results

**Output:**
```
1. IMDB Movie Reviews
   Dataset ID: imdb
   Tasks: sentiment-classification
   Languages: en
   Size: 84.1 MB
   Downloads: 1M+
   URL: https://huggingface.co/datasets/imdb

2. Stanford Sentiment Treebank
   Dataset ID: sst2
   Tasks: sentiment-classification
   Languages: en
   Size: 7.4 MB
   Downloads: 500K+
   URL: https://huggingface.co/datasets/sst2
```

### UCI ML Repository

Search and download classic ML datasets.

```bash
# Search datasets
python scripts/dataset.py uci search "classification"

# Search by characteristics
python scripts/dataset.py uci search "regression" --min-samples 1000

# Download dataset
python scripts/dataset.py uci download "iris"

# Download with metadata
python scripts/dataset.py uci download "wine-quality" --include-metadata
```

**Search options:**
- `--task-type` - classification, regression, clustering
- `--min-samples` - Minimum number of instances
- `--min-features` - Minimum number of features
- `--data-type` - tabular, text, image, time-series

**Output:**
```
1. Iris Dataset
   ID: iris
   Task: classification
   Samples: 150
   Features: 4
   Classes: 3
   Missing values: No
   URL: https://archive.ics.uci.edu/ml/datasets/iris

2. Wine Quality
   ID: wine-quality
   Task: classification/regression
   Samples: 6497
   Features: 11
   Missing values: No
   URL: https://archive.ics.uci.edu/ml/datasets/wine+quality
```

### Data.gov

Search US government open data.

```bash
# Search datasets
python scripts/dataset.py datagov search "census"

# Search with organization filter
python scripts/dataset.py datagov search "health" --organization "cdc.gov"

# Search by topic
python scripts/dataset.py datagov search "education" --tags "schools,students"

# Download dataset
python scripts/dataset.py datagov download "dataset-id"
```

**Search options:**
- `--organization` - Filter by publishing organization
- `--tags` - Filter by tags (comma-separated)
- `--format` - Filter by format (csv, json, xml, etc.)
- `--max-results` - Limit results

**Output:**
```
1. 2020 Census Demographic Data
   Organization: census.gov
   Format: CSV
   Size: 125 MB
   Last updated: 2023-01-15
   Tags: census, demographics, population
   URL: https://catalog.data.gov/dataset/...
```

---

## Dataset Management

### Preview Datasets

Get quick insights without loading entire datasets.

```bash
# Basic preview
python scripts/dataset.py preview data.csv

# Detailed statistics
python scripts/dataset.py preview data.csv --detailed

# Custom sample size
python scripts/dataset.py preview data.csv --sample 20

# Multiple files
python scripts/dataset.py preview train.csv test.csv
```

**Output:**
```
Dataset: train.csv
Shape: 1000 rows × 15 columns
Size: 2.5 MB
Memory usage: 120 KB

Columns:
  - id (int64): no missing values
  - name (object): 5 missing values
  - age (int64): no missing values
  - income (float64): 12 missing values
  - category (object): no missing values

Numeric columns statistics:
           age       income
count   1000.0       988.0
mean      35.2     65432.1
std       12.5     25000.0
min       18.0     20000.0
max       75.0    150000.0

Categorical columns:
  - category: 5 unique values
  - name: 995 unique values

S