mmVoiceMaker

ClawSkills 作者 blue-coconut v1.0.1
Enables voice synthesis, voice cloning, voice design, and audio post-processing using MiniMax Voice API and FFmpeg. Use when converting text to speech, creating custom voices, or processing/merging audio.
源码 ↗
安装 / 下载方式

TotalClaw CLI推荐
totalclaw install clawskills:blue-coconut~mm-voice-maker
cURL直接下载，无需登录
curl -fsSL https://skills.taituai.com/api/skills/clawskills%3Ablue-coconut~mm-voice-maker/file -o mm-voice-maker.md
Git 仓库获取源码
git clone https://github.com/openclaw/skills/commit/4cc72246318fe2461b88c8e8a0c2bb749a7d3a73
# MiniMax Voice Maker

Professional text-to-speech skill with emotion detection, voice cloning, and audio processing capabilities powered by MiniMax Voice API and FFmpeg.


## Capabilities

| Area | Features |
|------|----------|
| **TTS** | Sync (HTTP/WebSocket), async (long text), streaming |
| **Segment-based** | Multi-voice, multi-emotion synthesis from segments.json, auto merge |
| **Voice** | Cloning (10s–5min), design (text prompt), management |
| **Audio** | Format conversion, merge, normalize, trim, remove silence (FFmpeg) |

## File structure:
```
mmVoice_Maker/
├── SKILL.md                       # This overview
├── mmvoice.py                     # CLI tool (recommended for Agents)
├── check_environment.py           # Environment verification
├── requirements.txt
├── scripts/                       # Entry: scripts/__init__.py
│   ├── utils.py                   # Config, data classes
│   ├── sync_tts.py                # HTTP/WebSocket TTS
│   ├── async_tts.py               # Long text TTS
│   ├── segment_tts.py             # Segment-based TTS (multi-voice, multi-emotion)
│   ├── voice_clone.py             # Voice cloning
│   ├── voice_design.py            # Voice design
│   ├── voice_management.py        # List/delete voices
│   └── audio_processing.py        # FFmpeg audio tools
└── reference/                     # Load as needed
    ├── cli-guide.md               # CLI usage guide
    ├── getting-started.md         # Setup and quick test
    ├── tts-guide.md               # Sync/async TTS workflows
    ├── voice-guide.md             # Clone/design/manage
    ├── audio-guide.md             # Audio processing
    ├── script-examples.md         # Runnable code snippets
    ├── troubleshooting.md         # Common issues
    ├── api_documentation.md       # Complete API reference
    └── voice_catalog.md           # Voice selection guide
```


## Main Workflow Guideline (Text to Speech)

**6-step workflow:**
[step1]. Verify environment

[step2-preparation]⚠️NOTE: Before processing the text, you must read [voice-catalog.md](reference/voice-catalog.md) for voice selection.

[step2]. Process text into script → `<cwd>/audio/segments.json`. Note: [Step2.4] is really important, you must check it twice before sending the script to the user.

[step2.5]. ⚠️ Generate preview for user confirmation (highly recommended for multi-voice content)

[step3]. Present plan to user for confirmation

[step4]. Validate segments.json

[step5]. Generate and merge audio → intermediate files in `<cwd>/audio/tmp/`, final output in `<cwd>/audio/output.mp3`

[step6]. ⚠️ **CRITICAL**: User confirms audio quality FIRST → THEN cleanup temp files (only after user is satisfied)

> `<cwd>` is Claude's current working directory (not the skill directory). Audio files are saved relative to where Claude is running commands.

### Step 1: Verify environment

```bash
python check_environment.py
```

Checks:
- Python 3.8+
- Required packages (requests, websockets)
- FFmpeg installation
- MINIMAX_VOICE_API_KEY environment variable

If API key is not set, ask user for keys and set it:
```bash
export MINIMAX_VOICE_API_KEY="your-api-key-here"
```

### Step 2: Decision and Pre-processing

**⚠️ MOST IMPORTANT PRINCIPLE: Gender Matching First**

Before selecting voices, you MUST always match gender first. This is non-negotiable.

**Golden Rule:**
> **If a character is male → use male voice**
> **If a character is female → use female voice**
> **If a character is neutral/other → choose appropriate neutral voice**

**Why this matters:**
- Violating gender matching (e.g., male character with female voice) breaks immersion
- Even if personality traits match, gender comes first
- This is especially critical for classic literature, historical content, and professional narration

**Examples:**
| Character | Wrong Voice | Correct Voice |
|-----------|-------------|---------------|
| 唐三藏 (male monk) | `female-yujie` ❌ | `Chinese (Mandarin)_Gentleman` ✅ |
| 林黛玉 (female) | `male-qn-badao` ❌ | `female-shaonv` ✅ |
| 曹操 (male warlord) | `female-chengshu` ❌ | `Chinese (Mandarin)_Unrestrained_Young_Man` ✅ |

**Decision guide:**
Evaluate based on:
- Does the user specify a model? → Use that model, or use the default one "speech-2.8"
- Is multi-voice needed? → Different voice_id per speaker/character
- For speech-2.8: emotion is auto-matched (leave `emotion` empty)
- For older models: manually specify emotion tags

**Use case scenarios:**

| Scenario | Description | Segments | Voice Selection |
|----------|-------------|----------|-----------------|
| **Single Voice** | User needs one voice for the entire content. Segment only by length (≤1,000,000 chars per segment). | Split by length only | One voice_id for all segments |
| **Multi-Voice** | Multiple characters/speakers, each with different voice. Segment by speaker/role changes. | Split by logical unit (speaker, dialogue, etc.) | Different voice_id per role |
| **Podcast/Interview** | Host and guest speakers with distinct voices. | Split by speaker | Voice per host/guest |
| **Audiobook/Fiction** | Narrator and character voices. | Split by narration vs. dialogue | Voice per narrator/character |
| **Documentary** | Mostly narration with occasional quotes. | Keep as one segment | Single narrator voice |
| **Report/Announcement** | Formal content with consistent tone. | Keep as one segment | Professional voice |

**Processing Workflow (4 sub-steps):**

**Step 2.1: Text Segmentation and Role Analysis**
First, segment your text into logical units and identify the role/character for each segment.

**Key principle (Important!): Split by logical unit, NOT simply by sentence**

**When to split (Important!):**
- Different speakers clearly marked
- Narrator vs. character dialogue (in fiction/audiobooks/interview etc.)
- In some scenarios (like audiobooks, multi-voice fiction etc.), where speaker's identity is important, split when narration and dialogue mix in the same sentence.

**When NOT to split (Important!):**
- Third-person narration like "John said..." or "The reporter noted..."
- Quoted speech in narration (in documentary/podcast/report etc.) should keep in narrator's voice
- Keep in narrator's voice unless specific characterization is needed

**Decision depends on use case:**

| Use case | Example | Split strategy |
|----------|---------|----------------|
| **Single Voice** | Long article, news piece, announcement | Split by length (≤1,000,000 chars), same voice for all |
| **Podcast/Interview** | "Host: Welcome to the show. Guest: Thank you for having me." | Split by speaker |
| **Documentary narration** | "The scientist explained, 'The results are promising.'" | Keep as one segment (narrator voice) |
| **Audiobook/Fiction** | "'Who's there?' she whispered." | Split: "'Who's there?'" should be in character voice, while "she whispered." should be in narrator's voice |
| **Report** | "According to the report, the economy is growing." | Keep as one segment |

**Example1: Single Voice (speech-2.8)**
For single-voice content (e.g., news, announcements, articles), segment only by length while maintaining the same voice:
```json
[
  {"text": "First part of the article (under 1,000,000 chars)...", "role": "narrator", "voice_id": "female-shaonv", "emotion": ""},
  {"text": "Second part of the article (under 1,000,000 chars)...", "role": "narrator", "voice_id": "female-shaonv", "emotion": ""},
  {"text": "Third part of the article (under 1,000,000 chars)...", "role": "narrator", "voice_id": "female-shaonv", "emotion": ""}
]
```

**Example2: Audiobook with characters (speech-2.8)**
In audiobooks (multi-voice fiction), split when narration and dialogue mix in the same sentence:
```json
[
  {"text": "The detective entered the room.", "role": "narrator", "voice_id": "", "emotion": ""},
  {"text": "\"Who's there?\"", "role": "female_character", "voice_id": "", "emotion": ""},
  {"text": "she whispered.", "role": "narrator", "voice_id": "", "emotion": ""},
  {"text": "\"It'