video-news-downloader

TotalClaw 作者 totalclaw

自动化每日新闻视频下载器,带有人工智能字幕校对功能。从 YouTube 下载 CBS 晚间新闻和 BBC 新闻十点,使用 DeepSeek 提取和校对字幕,通过 HTTP 使用嵌入式播放器提供视频。在以下情况下使用:(1) 设置自动每日新闻视频下载,(2) 下载带字幕的 CBS/BBC 新闻,(3) 使用 AI 校对字幕文件,(4) 使用网络播放器创建本地视频流服务器,(5) 管理计划视频更新的 cron 作业。

安装 / 下载方式

TotalClaw CLI推荐
totalclaw install totalclaw:totalclaw~zlc000190-using-superpowers
cURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~zlc000190-using-superpowers/file -o zlc000190-using-superpowers.md
# Video News Downloader with AI Subtitle Proofreading

Complete workflow for downloading daily news videos, processing subtitles, and serving them via HTTP with web players.

## Overview

This skill automates:
1. **Video Download**: CBS Evening News + BBC News at Ten from YouTube
2. **Subtitle Processing**: Extract auto-captions and convert to VTT format
3. **AI Proofreading**: Use DeepSeek to fix speech recognition errors
4. **HTTP Streaming**: Serve videos with embedded web players
5. **Scheduled Updates**: Daily cron jobs at configurable times

## Quick Start

### 1. Download Latest News

```bash
python3 scripts/video_download.py --cbs --bbc
```

### 2. Proofread Subtitles

```bash
python3 scripts/subtitle_proofreader.py /path/to/subtitle.vtt
```

Or use DeepSeek directly:
> "校对字幕文件 /path/to/subtitle.vtt"

### 3. Start HTTP Servers

```bash
bash scripts/setup_server.sh
```

### 4. Setup Daily Cron Jobs

```bash
bash scripts/setup_cron.sh
```

## Commands

### Video Download Script

**Download CBS only:**
```bash
python3 scripts/video_download.py --cbs
```

**Download BBC only:**
```bash
python3 scripts/video_download.py --bbc
```

**Download both:**
```bash
python3 scripts/video_download.py --cbs --bbc
```

**With subtitle proofreading:**
```bash
python3 scripts/video_download.py --cbs --bbc --proofread
```

### Subtitle Proofreading

**Proofread single file:**
```bash
python3 scripts/subtitle_proofreader.py <vtt_file_path>
```

**Auto-proofread all news subtitles:**
```bash
python3 scripts/subtitle_proofreader.py --all
```

### Server Management

**Start servers:**
```bash
bash scripts/setup_server.sh start
```

**Check status:**
```bash
bash scripts/setup_server.sh status
```

**Stop servers:**
```bash
bash scripts/setup_server.sh stop
```

## File Structure

```
/workspace/
├── cbs-live-local/
│   ├── cbs_latest.mp4
│   ├── cbs_latest.en.vtt          # Original subtitle
│   ├── cbs_latest.en.vtt-backup   # Backup
│   ├── cbs_latest-corrected.txt   # DeepSeek corrected text
│   └── cbs_latest-corrections.md  # Error list
│
├── bbc-news-live/
│   ├── bbc_news_latest.mp4
│   ├── bbc_news_latest.en.vtt
│   ├── bbc_news_latest.en.vtt-backup
│   ├── bbc_news_latest-corrected.txt
│   └── bbc_news_latest-corrections.md
│
└── temp/                           # Temporary download files
```

## HTTP Endpoints

| Endpoint | Description |
|----------|-------------|
| http://IP:8093/ | CBS Evening News player |
| http://IP:8093/cbs_latest.mp4 | CBS video direct |
| http://IP:8095/ | BBC News at Ten player |
| http://IP:8095/bbc_news_latest.mp4 | BBC video direct |

## Cron Jobs

### Default Schedule (Beijing Time)

| Time | Task |
|------|------|
| 20:00 | Download latest CBS + BBC videos |
| 20:30 | DeepSeek proofread subtitles |

### Manual Cron Setup

See [references/cron-setup.md](references/cron-setup.md) for detailed cron configuration.

## DeepSeek Proofreading

### What Gets Fixed

- Speech recognition errors (e.g., "noraster" → "nor'easter")
- Name errors (e.g., "trunk" → "Trump")
- Location name errors
- Professional terminology errors
- Obvious spelling mistakes

### Output Files

For each subtitle file, generates:
1. `-backup.vtt` - Original subtitle (never modified)
2. `-corrected.txt` - AI-corrected plain text
3. `-corrections.md` - List of corrections made

## Troubleshooting

### Video Download Fails

- Check yt-dlp is installed: `yt-dlp --version`
- Check YouTube URL is accessible
- Try manual download first

### Subtitle Extraction Fails

- Some videos don't have auto-captions
- Check if `--list-subs` shows available languages

### Server Won't Start

- Check ports 8093/8095 are free: `lsof -i :8093`
- Check Python http.server is available

### Proofreading Issues

- Ensure DeepSeek model is available
- Check subtitle file exists and is valid VTT format

## See Also

- [references/workflow.md](references/workflow.md) - Detailed workflow documentation
- [references/cron-setup.md](references/cron-setup.md) - Cron job configuration guide