Dinstein Tech News Digest

SkillDB 作者 asterisk622 v3.15.0

Generate tech news digests with unified source model, quality scoring, and multi-format output. Six-source data collection from RSS feeds, Twitter/X KOLs, GitHub releases, GitHub Trending, Reddit, and web search. Pipeline-based scripts with retry mechanisms and deduplication. Supports Discord, email, and markdown templates.

源码 ↗

安装 / 下载方式

TotalClaw CLI推荐
totalclaw install skilldb:asterisk622~xiaoding-dinstein-tech-news-digest
cURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/skilldb%3Aasterisk622~xiaoding-dinstein-tech-news-digest/file -o xiaoding-dinstein-tech-news-digest.md
Git 仓库获取源码
git clone https://github.com/openclaw/skills/commit/ffc40e3021fd4867355af54a0e553da5004acc74
# Tech News Digest

Automated tech news digest system with unified data source model, quality scoring pipeline, and template-based output generation.

## Quick Start

1. **Configuration Setup**: Default configs are in `config/defaults/`. Copy to workspace for customization:
   ```bash
   mkdir -p workspace/config
   cp config/defaults/sources.json workspace/config/tech-news-digest-sources.json
   cp config/defaults/topics.json workspace/config/tech-news-digest-topics.json
   ```

2. **Environment Variables**: 
   - `TWITTERAPI_IO_KEY` - twitterapi.io API key (optional, preferred)
   - `X_BEARER_TOKEN` - Twitter/X official API bearer token (optional, fallback)
   - `TAVILY_API_KEY` - Tavily Search API key, alternative to Brave (optional)
   - `WEB_SEARCH_BACKEND` - Web search backend: auto|brave|tavily (optional, default: auto)
   - `BRAVE_API_KEYS` - Brave Search API keys, comma-separated for rotation (optional)
   - `BRAVE_API_KEY` - Single Brave key fallback (optional)
   - `GITHUB_TOKEN` - GitHub personal access token (optional, improves rate limits)

3. **Generate Digest**:
   ```bash
   # Unified pipeline (recommended) — runs all 6 sources in parallel + merge
   python3 scripts/run-pipeline.py \
     --defaults config/defaults \
     --config workspace/config \
     --hours 48 --freshness pd \
     --archive-dir workspace/archive/tech-news-digest/ \
     --output /tmp/td-merged.json --verbose --force
   ```

4. **Use Templates**: Apply Discord, email, or PDF templates to merged output

## Configuration Files

### `sources.json` - Unified Data Sources
```json
{
  "sources": [
    {
      "id": "openai-rss",
      "type": "rss",
      "name": "OpenAI Blog",
      "url": "https://openai.com/blog/rss.xml",
      "enabled": true,
      "priority": true,
      "topics": ["llm", "ai-agent"],
      "note": "Official OpenAI updates"
    },
    {
      "id": "sama-twitter",
      "type": "twitter", 
      "name": "Sam Altman",
      "handle": "sama",
      "enabled": true,
      "priority": true,
      "topics": ["llm", "frontier-tech"],
      "note": "OpenAI CEO"
    }
  ]
}
```

### `topics.json` - Enhanced Topic Definitions
```json
{
  "topics": [
    {
      "id": "llm",
      "emoji": "🧠",
      "label": "LLM / Large Models",
      "description": "Large Language Models, foundation models, breakthroughs",
      "search": {
        "queries": ["LLM latest news", "large language model breakthroughs"],
        "must_include": ["LLM", "large language model", "foundation model"],
        "exclude": ["tutorial", "beginner guide"]
      },
      "display": {
        "max_items": 8,
        "style": "detailed"
      }
    }
  ]
}
```

## Scripts Pipeline

### `run-pipeline.py` - Unified Pipeline (Recommended)
```bash
python3 scripts/run-pipeline.py \
  --defaults config/defaults [--config CONFIG_DIR] \
  --hours 48 --freshness pd \
  --archive-dir workspace/archive/tech-news-digest/ \
  --output /tmp/td-merged.json --verbose --force
```
- **Features**: Runs all 6 fetch steps in parallel, then merges + deduplicates + scores
- **Output**: Final merged JSON ready for report generation (~30s total)
- **Metadata**: Saves per-step timing and counts to `*.meta.json`
- **GitHub Auth**: Auto-generates GitHub App token if `$GITHUB_TOKEN` not set
- **Fallback**: If this fails, run individual scripts below

### Individual Scripts (Fallback)

#### `fetch-rss.py` - RSS Feed Fetcher
```bash
python3 scripts/fetch-rss.py [--defaults DIR] [--config DIR] [--hours 48] [--output FILE] [--verbose]
```
- Parallel fetching (10 workers), retry with backoff, feedparser + regex fallback
- Timeout: 30s per feed, ETag/Last-Modified caching

#### `fetch-twitter.py` - Twitter/X KOL Monitor
```bash
python3 scripts/fetch-twitter.py [--defaults DIR] [--config DIR] [--hours 48] [--output FILE] [--backend auto|official|twitterapiio]
```
- Backend auto-detection: uses twitterapi.io if `TWITTERAPI_IO_KEY` set, else official X API v2 if `X_BEARER_TOKEN` set
- Rate limit handling, engagement metrics, retry with backoff

#### `fetch-web.py` - Web Search Engine
```bash
python3 scripts/fetch-web.py [--defaults DIR] [--config DIR] [--freshness pd] [--output FILE]
```
- Auto-detects Brave API rate limit: paid plans → parallel queries, free → sequential
- Without API: generates search interface for agents

#### `fetch-github.py` - GitHub Releases Monitor
```bash
python3 scripts/fetch-github.py [--defaults DIR] [--config DIR] [--hours 168] [--output FILE]
```
- Parallel fetching (10 workers), 30s timeout
- Auth priority: `$GITHUB_TOKEN` → GitHub App auto-generate → `gh` CLI → unauthenticated (60 req/hr)


#### `fetch-github.py --trending` - GitHub Trending Repos
```bash
python3 scripts/fetch-github.py --trending [--hours 48] [--output FILE] [--verbose]
```
- Searches GitHub API for trending repos across 4 topics (LLM, AI Agent, Crypto, Frontier Tech)
- Quality scoring: base 5 + daily_stars_est / 10, max 15

#### `fetch-reddit.py` - Reddit Posts Fetcher
```bash
python3 scripts/fetch-reddit.py [--defaults DIR] [--config DIR] [--hours 48] [--output FILE]
```
- Parallel fetching (4 workers), public JSON API (no auth required)
- 13 subreddits with score filtering


#### `enrich-articles.py` - Article Full-Text Enrichment
```bash
python3 scripts/enrich-articles.py --input merged.json --output enriched.json [--min-score 10] [--max-articles 15] [--verbose]
```
- Fetches full article text for high-scoring articles
- Cloudflare Markdown for Agents (preferred) → HTML extraction (fallback) → Skip (paywalled/social)
- Blog domain whitelist with lower score threshold (≥3)
- Parallel fetching (5 workers, 10s timeout)

#### `merge-sources.py` - Quality Scoring & Deduplication
```bash
python3 scripts/merge-sources.py --rss FILE --twitter FILE --web FILE --github FILE --reddit FILE
```
- Quality scoring, title similarity dedup (85%), previous digest penalty
- Output: topic-grouped articles sorted by score

#### `validate-config.py` - Configuration Validator
```bash
python3 scripts/validate-config.py [--defaults DIR] [--config DIR] [--verbose]
```
- JSON schema validation, topic reference checks, duplicate ID detection

#### `generate-pdf.py` - PDF Report Generator
```bash
python3 scripts/generate-pdf.py --input report.md --output digest.pdf [--verbose]
```
- Converts markdown digest to styled A4 PDF with Chinese typography (Noto Sans CJK SC)
- Emoji icons, page headers/footers, blue accent theme. Requires `weasyprint`.

#### `sanitize-html.py` - Safe HTML Email Converter
```bash
python3 scripts/sanitize-html.py --input report.md --output email.html [--verbose]
```
- Converts markdown to XSS-safe HTML email with inline CSS
- URL whitelist (http/https only), HTML-escaped text content

#### `source-health.py` - Source Health Monitor
```bash
python3 scripts/source-health.py --rss FILE --twitter FILE --github FILE --reddit FILE --web FILE [--verbose]
```
- Tracks per-source success/failure history over 7 days
- Reports unhealthy sources (>50% failure rate)

#### `summarize-merged.py` - Merged Data Summary
```bash
python3 scripts/summarize-merged.py --input merged.json [--top N] [--topic TOPIC]
```
- Human-readable summary of merged data for LLM consumption
- Shows top articles per topic with scores and metrics

## User Customization

### Workspace Configuration Override
Place custom configs in `workspace/config/` to override defaults:

- **Sources**: Append new sources, disable defaults with `"enabled": false`
- **Topics**: Override topic definitions, search queries, display settings
- **Merge Logic**: 
  - Sources with same `id` → user version takes precedence
  - Sources with new `id` → appended to defaults
  - Topics with same `id` → user version completely replaces default

### Example Workspace Override
```json
// workspace/config/tech-news-digest-sources.json
{
  "sources": [
    {
      "id": "simonwillison-rss",
      "enabled": false,
      "note": "Disabled: too noisy for my use case"
    },
    {
      "id": "my-c