voiceai-creator-voiceover-pipeline

TotalClaw 作者 totalclaw v0.2.1

使用 Voice.ai TTS 将脚本转换为可发布的配音，包括片段、章节、字幕和视频混合。

安装 / 下载方式

TotalClaw CLI推荐

totalclaw install totalclaw:totalclaw~gizmogremlin-voiceai-voiceover-creator

cURL直接下载，无需登录

curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~gizmogremlin-voiceai-voiceover-creator/file -o gizmogremlin-voiceai-voiceover-creator.md

## 概述（中文）

使用 Voice.ai TTS 将脚本转换为可发布的配音，包括片段、章节、字幕和视频混合。

## 原文

# Voice.ai Creator Voiceover Pipeline

> This skill follows the [Agent Skills specification](https://agentskills.io/specification).

Turn any script into a **publish-ready voiceover** — complete with numbered segments, a stitched master, YouTube chapters, SRT captions, and a beautiful review page. Optionally, replace the audio track on an existing video.

Built for creators who want studio-quality voiceovers without the studio. Powered by [Voice.ai](https://voice.ai).

---

## When to use this skill

| Scenario | Why it fits |
|---|---|
| **YouTube long-form** | Full narration with chapter markers and captions |
| **YouTube Shorts** | Quick hooks with the `shortform` template |
| **Podcasts** | Consistent host voice, intro/outro templates |
| **Course content** | Professional narration for educational videos |
| **Quick iteration** | Smart caching — edit one section, only that segment re-renders |
| **Video audio replacement** | Drop AI voiceover onto screen recordings or B-roll |

---

## The one-command workflow

Have a script and a video? Turn them into a finished video with AI voiceover in one shot:

```bash
node voiceai-vo.cjs build \
  --input my-script.md \
  --voice oliver \
  --title "My Video" \
  --video ./my-recording.mp4 \
  --mux
```

This renders the voiceover, stitches the master audio, and drops it onto your video — all in one command. Output:

- `out/my-video/muxed.mp4` — your video with the new voiceover
- `out/my-video/master.wav` — the standalone audio
- `out/my-video/review.html` — listen and review each segment
- `out/my-video/chapters.txt` — YouTube-ready chapter timestamps
- `out/my-video/captions.srt` — SRT captions

Use `--sync pad` if the audio is shorter than the video, or `--sync trim` to cut it to match.

---

## Requirements

- **Node.js 20+** — runtime (no npm install needed — the CLI is a single bundled file)
- **VOICE_AI_API_KEY** — set as environment variable or in a `.env` file in the skill root. Get a key at [voice.ai/dashboard](https://voice.ai/dashboard).
- **ffmpeg** (optional) — needed for master stitching, MP3 encoding, loudness normalization, and video muxing. The pipeline still produces individual segments, the review page, chapters, and captions without it.

---

## Configuration

The skill reads `VOICE_AI_API_KEY` from (in order):

1. Environment variable `VOICE_AI_API_KEY`
2. Environment variable `VOICEAI_API_KEY` (alternate)
3. `.env` file in the skill root

```bash
echo 'VOICE_AI_API_KEY=your-key-here' > .env
```

Use `--mock` on any command to run the full pipeline without an API key (produces placeholder audio).

---

## Commands

### `build` — Generate a voiceover from a script

```bash
node voiceai-vo.cjs build \
  --input <script.md or script.txt> \
  --voice <voice-alias-or-uuid> \
  --title "My Project" \
  [--template youtube|podcast|shortform] \
  [--language en] \
  [--video input.mp4 --mux --sync shortest] \
  [--force] [--mock]
```

**What it does:**

1. Reads the script and splits it into segments (by `##` headings for `.md`, or by sentence boundaries for `.txt`)
2. Optionally prepends/appends template intro/outro segments
3. Renders each segment via Voice.ai TTS as a numbered WAV file
4. Stitches a master audio file (if ffmpeg is available)
5. Generates chapters, captions, a review page, and metadata files
6. Optionally muxes the voiceover into an existing video

**Full options:**

| Option | Description |
|---|---|
| `-i, --input <path>` | Script file (.txt or .md) — **required** |
| `-v, --voice <id>` | Voice alias or UUID — **required** |
| `-t, --title <title>` | Project title (defaults to filename) |
| `--template <name>` | `youtube`, `podcast`, or `shortform` |
| `--mode <mode>` | `headings` or `auto` (default: headings for .md) |
| `--max-chars <n>` | Max characters per auto-chunk (default: 1500) |
| `--language <code>` | Language code (default: en) |
| `--video <path>` | Input video for muxing |
| `--mux` | Enable video muxing (requires --video) |
| `--sync <policy>` | `shortest`, `pad`, or `trim` (default: shortest) |
| `--force` | Re-render all segments (ignore cache) |
| `--mock` | Mock mode — no API calls, placeholder audio |
| `-o, --out <dir>` | Custom output directory |

### `replace-audio` — Swap the audio track on a video

```bash
node voiceai-vo.cjs replace-audio \
  --video ./input.mp4 \
  --audio ./out/my-project/master.wav \
  [--out ./out/my-project/muxed.mp4] \
  [--sync shortest|pad|trim]
```

Requires ffmpeg. If not installed, generates helper shell/PowerShell scripts instead.

| Sync policy | Behavior |
|---|---|
| `shortest` (default) | Output ends when the shorter track ends |
| `pad` | Pad audio with silence to match video duration |
| `trim` | Trim audio to match video duration |

Video stream is copied without re-encoding (`-c:v copy`). Audio is encoded as AAC. A mux report is saved alongside the output.

**Privacy:** Video processing is entirely local. Only script text is sent to Voice.ai for TTS.

### `voices` — List available voices

```bash
node voiceai-vo.cjs voices [--limit 20] [--query "deep"] [--mock]
```

---

## Available voices

Use short aliases or full UUIDs with `--voice`:

| Alias    | Voice                | Gender | Style                    |
|----------|----------------------|--------|--------------------------|
| `ellie`  | Ellie                | F      | Youthful, vibrant vlogger|
| `oliver` | Oliver               | M      | Friendly British         |
| `lilith` | Lilith               | F      | Soft, feminine           |
| `smooth` | Smooth Calm Voice    | M      | Deep, smooth narrator    |
| `corpse` | Corpse Husband       | M      | Deep, distinctive        |
| `skadi`  | Skadi                | F      | Anime character          |
| `zhongli`| Zhongli              | M      | Deep, authoritative      |
| `flora`  | Flora                | F      | Cheerful, high pitch     |
| `chief`  | Master Chief         | M      | Heroic, commanding       |

The `voices` command also returns any additional voices available on the API. Voice list is cached for 10 minutes.

---

## Build outputs

After a build, the output directory contains:

```
out/<title-slug>/
  segments/           # Numbered WAV files (001-intro.wav, 002-section.wav, …)
  master.wav          # Stitched audio (requires ffmpeg)
  master.mp3          # MP3 encode (requires ffmpeg)
  manifest.json       # Build metadata: voice, template, segment list, hashes
  timeline.json       # Segment durations and start times
  review.html         # Interactive review page with audio players
  chapters.txt        # YouTube-friendly chapter timestamps
  captions.srt        # SRT captions using segment boundaries
  description.txt     # YouTube description with chapters + Voice.ai credit
```

### review.html

A standalone HTML page with:
- Master audio player (if stitched)
- Individual segment players with titles and durations
- Collapsible script text for each segment
- Regeneration command hints

---

## Templates

Templates auto-inject intro/outro segments around the script content:

| Template | Prepends | Appends |
|---|---|---|
| `youtube` | `templates/youtube_intro.txt` | `templates/youtube_outro.txt` |
| `podcast` | `templates/podcast_intro.txt` | — |
| `shortform` | `templates/shortform_hook.txt` | — |

Edit the files in `templates/` to customize the intro/outro text.

---

## Caching

Segments are cached by a hash of: `text content + voice ID + language`.

- Unchanged segments are **skipped** on rebuild — fast iteration
- Modified segments are **re-rendered** automatically
- Use `--force` to re-render everything
- Cache manifest is stored in `segments/.cache.json`

---

## Multilingual support

Voice.ai supports 11 languages. Use `--language <code>` to switch:

`en`, `es`, `fr`, `de`, `it`, `pt`, `pl`, `ru`, `nl`, `sv`, `ca`

The pipeline auto-selects the multilingual TTS model for non-English languages.

---

## Troubles