ugc-manual

TotalClaw 作者 totalclaw

从图像+用户自己的录音生成口型同步视频。 ✅ 使用时间： - 用户提供自己的音频文件（录音） - 想要将图像同步到特定的音频/语音 - 用户自己录制脚本 - 需要保留精确的音频定时 ❌ 不要在以下情况下使用： - 用户提供文本脚本（非音频）→使用 veed-ugc - 需要AI生成语音→使用veed-ugc - 还没有音频文件 → 使用 veed-ugc 和脚本输入：图像+音频文件（用户录音）输出：MP4 视频，与提供的音频进行口型同步主要区别：veed-ugc = 脚本 → AI 语音 → 视频 ugc-manual = 用户音频 → 视频（无语音生成）

安装 / 下载方式

TotalClaw CLI推荐

totalclaw install totalclaw:totalclaw~pauldelavallaz-ugc-manual

cURL直接下载，无需登录

curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~pauldelavallaz-ugc-manual/file -o pauldelavallaz-ugc-manual.md

# UGC-Manual

Generate lip-sync videos by combining an image with a custom audio file using ComfyDeploy's UGC-MANUAL workflow.

## Overview

UGC-Manual takes:
1. An image (person/character with visible face)
2. An audio file (user's voice recording)

And produces a video where the person in the image lip-syncs to the audio.

## API Details

**Endpoint:** `https://api.comfydeploy.com/api/run/deployment/queue`
**Deployment ID:** `075ce7d3-81a6-4e3e-ab0e-7a25edf601b5`

## Required Inputs

| Input | Description | Formats |
|-------|-------------|---------|
| `image` | Image with a visible face | JPG, PNG |
| `input_audio` | Audio file to lip-sync | MP3, WAV, OGG |

## Usage

```bash
uv run ~/.clawdbot/skills/ugc-manual/scripts/generate.py \
  --image "path/to/image.jpg" \
  --audio "path/to/audio.mp3" \
  --output "output-video.mp4"
```

### With URLs:
```bash
uv run ~/.clawdbot/skills/ugc-manual/scripts/generate.py \
  --image "https://example.com/image.jpg" \
  --audio "https://example.com/audio.mp3" \
  --output "result.mp4"
```

## Workflow Integration

### Typical Use Cases

1. **Custom voice recordings** - User records their own audio via Telegram/WhatsApp
2. **Pre-generated TTS** - Audio generated externally (ElevenLabs, etc.)
3. **Music/sound sync** - Sync mouth movements to any audio

### Example Pipeline

```bash
# 1. Convert Telegram voice message to MP3 (if needed)
ffmpeg -i voice.ogg -acodec libmp3lame -q:a 2 voice.mp3

# 2. Generate lip-sync video
uv run ugc-manual... --image face.jpg --audio voice.mp3 --output video.mp4
```

## Difference from VEED-UGC

| Feature | UGC-Manual | VEED-UGC |
|---------|------------|----------|
| Audio source | User provides | Generated from brief |
| Script | N/A | Auto-generated |
| Voice | User's recording | ElevenLabs TTS |
| Use case | Custom audio | Automated content |

## Notes

- Image should have a clearly visible face (frontal or 3/4 view)
- Audio quality affects output quality
- Processing time: ~2-5 minutes depending on audio length
- **Audio auto-conversion**: The script automatically converts any audio format (MP3, OGG, M4A, etc.) to WAV PCM 16-bit mono 48kHz before sending to FabricLipsync
- Requires `ffmpeg` installed on the system

---

## 中文说明

# UGC-Manual

使用 ComfyDeploy 的 UGC-MANUAL 工作流，通过将一张图像与自定义音频文件结合，生成口型同步视频。

## 概述

UGC-Manual 接收：
1. 一张图像（具有可见面部的人物/角色）
2. 一个音频文件（用户的语音录音）

并生成一段视频，其中图像中的人物与音频进行口型同步。

## API 详情

**Endpoint:** `https://api.comfydeploy.com/api/run/deployment/queue`
**Deployment ID:** `075ce7d3-81a6-4e3e-ab0e-7a25edf601b5`

## 必需输入

| Input | Description | Formats |
|-------|-------------|---------|
| `image` | 具有可见面部的图像 | JPG, PNG |
| `input_audio` | 用于口型同步的音频文件 | MP3, WAV, OGG |

## 用法

```bash
uv run ~/.clawdbot/skills/ugc-manual/scripts/generate.py \
  --image "path/to/image.jpg" \
  --audio "path/to/audio.mp3" \
  --output "output-video.mp4"
```

### 使用 URL：
```bash
uv run ~/.clawdbot/skills/ugc-manual/scripts/generate.py \
  --image "https://example.com/image.jpg" \
  --audio "https://example.com/audio.mp3" \
  --output "result.mp4"
```

## 工作流集成

### 典型用例

1. **自定义语音录音** - 用户通过 Telegram/WhatsApp 录制自己的音频
2. **预先生成的 TTS** - 在外部生成的音频（ElevenLabs 等）
3. **音乐/声音同步** - 将口型动作与任意音频同步

### 示例流程

```bash
# 1. Convert Telegram voice message to MP3 (if needed)
ffmpeg -i voice.ogg -acodec libmp3lame -q:a 2 voice.mp3

# 2. Generate lip-sync video
uv run ugc-manual... --image face.jpg --audio voice.mp3 --output video.mp4
```

## 与 VEED-UGC 的区别

| Feature | UGC-Manual | VEED-UGC |
|---------|------------|----------|
| 音频来源 | 用户提供 | 根据简报生成 |
| 脚本 | 不适用 | 自动生成 |
| 语音 | 用户的录音 | ElevenLabs TTS |
| 用例 | 自定义音频 | 自动化内容 |

## 说明

- 图像应具有清晰可见的面部（正面或 3/4 视角）
- 音频质量会影响输出质量
- 处理时间：约 2-5 分钟，取决于音频时长
- **音频自动转换**：脚本会在发送给 FabricLipsync 之前，自动将任意音频格式（MP3、OGG、M4A 等）转换为 WAV PCM 16 位单声道 48kHz
- 需要系统已安装 `ffmpeg`