MiniMaxStudio
Create voice, music, and video with MiniMax AI models. Unified skill for TTS voice synthesis (text-to-speech, voice cloning, voice design, multi-segment generation), music generation (songs, instrumentals), video creation (text-to-video, image-to-video, start-end frame, subject reference, templates, long-form multi-scene), and media processing (audio/video format conversion, concatenation, trimming, extraction). Use when user wants to generate speech audio, create songs or instrumental tracks, produce AI videos, clone or design voices, convert media formats, merge/split audio or video files, or work with MiniMax APIs.
安装 / 下载方式
TotalClaw CLI推荐
totalclaw install clawskills:minimax-ai-dev~minimaxstudiocURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/clawskills%3Aminimax-ai-dev~minimaxstudio/file -o minimaxstudio.mdGit 仓库获取源码
git clone https://github.com/openclaw/skills/commit/cfbb405aaa49ad0403db4f9192cc2e608c31ff93# MiniMax Studio
Generate voice, music, and video content via MiniMax APIs. Includes voice cloning & voice design for custom voices, and FFmpeg-based media tools for audio/video format conversion, concatenation, trimming, and extraction.
## Output Directory
**All generated files MUST be saved to `minimax-output/` under the AGENT'S current working directory (NOT the skill directory).** Every script call MUST include an explicit `--output` / `-o` argument pointing to this location. Never omit the output argument or rely on script defaults.
**Rules:**
1. Before running any script, ensure `minimax-output/` exists in the agent's working directory (create if needed: `mkdir -p minimax-output`)
2. Always use absolute or relative paths from the agent's working directory: `--output minimax-output/video.mp4`
3. **Never** `cd` into the skill directory to run scripts — run from the agent's working directory using the full script path
4. Intermediate/temp files (segment audio, video segments, extracted frames) are automatically placed in `minimax-output/tmp/`. They can be cleaned up when no longer needed: `rm -rf minimax-output/tmp`
## Prerequisites
```bash
pip install -r requirements.txt # requests, websockets, ffmpeg-python
brew install ffmpeg # macOS
python scripts/check_environment.py
```
### API Key Configuration
The `MINIMAX_API_KEY` can be provided in two ways (either works):
1. **`.env` file** (recommended — persists across sessions):
Create a `.env` file in the MiniMaxStudio project root (alongside SKILL.md):
```
MINIMAX_API_KEY=sk-api-xxxxxxxxxxxxxxxxxxxxxxxx
```
All scripts automatically load `.env` on startup. Environment variables take precedence over `.env` values.
2. **Environment variable**:
```bash
export MINIMAX_API_KEY="sk-api-xxxxxxxxxxxxxxxxxxxxxxxx"
```
**IMPORTANT — When API Key is missing:**
Before running any script, check if `MINIMAX_API_KEY` is available (via env var or `.env` file). If it is NOT configured:
1. Ask the user to provide their MiniMax API key
2. Write the key to the `.env` file in the **MiniMaxStudio skill directory** (i.e., the directory containing this SKILL.md): `echo 'MINIMAX_API_KEY=sk-api-xxxxx' > <skill_directory>/.env`
3. Do NOT store the key in the agent's working directory — always write to the skill directory so it persists for all future sessions
4. The key starts with `sk-api-`, obtainable from https://platform.minimaxi.com
## Key Capabilities
| Capability | Description | Entry point |
|------------|-------------|-------------|
| TTS | Text-to-speech synthesis with multiple voices and emotions | `scripts/tts/generate_voice.py` |
| Voice Cloning | Clone a voice from an audio sample (10s–5min) | `scripts/tts/generate_voice.py clone` |
| Voice Design | Create a custom voice from a text description | `scripts/tts/generate_voice.py design` |
| Music Generation | Generate songs with lyrics or instrumental tracks | `scripts/music/generate_music.py` |
| Video Generation | Text-to-video, image-to-video, subject reference, templates | `scripts/video/generate_video.py` |
| Long Video | Multi-scene chained video with crossfade transitions | `scripts/video/generate_long_video.py` |
| Media Tools | Audio/video format conversion, concatenation, trimming, extraction | `scripts/media_tools.py` |
## TTS (Text-to-Speech)
Entry point: `scripts/tts/generate_voice.py`
### IMPORTANT: Single voice vs Multi-segment — Choose the right approach
| User intent | Approach |
|-------------|----------|
| Single voice / no multi-character need | `tts` command — generate the entire text in one call |
| Multiple characters / narrator + dialogue | `generate` command with segments.json |
**Default behavior:** When the user simply asks to generate speech/voice and does NOT mention multiple voices or characters, use the `tts` command directly with a single appropriate voice. Do NOT split into segments or use the multi-segment pipeline — just pass the full text to `tts` in one call.
Only use multi-segment `generate` when:
- The user explicitly needs multiple voices/characters
- The text requires narrator + character dialogue separation
- The text exceeds **10,000 characters** (API limit per request) — in this case, split into segments with the same voice
### Single-voice generation (DEFAULT)
```bash
python scripts/tts/generate_voice.py tts "Hello world" -o minimax-output/hello.mp3
python scripts/tts/generate_voice.py tts "你好世界" -v female-shaonv -o minimax-output/hello_cn.mp3
```
### Multi-segment generation (multi-voice / audiobook / podcast)
**Complete workflow — follow ALL steps in order:**
1. **Write segments.json** — split text into segments with voice assignments (see format and rules below)
2. **Run `generate` command** — this reads segments.json, generates audio for EACH segment via TTS API, then merges them into a single output file with crossfade
```bash
# Step 1: Write segments.json to minimax-output/
# (use the Write tool to create minimax-output/segments.json)
# Step 2: Generate audio from segments.json — this is the CRITICAL step
# It generates each segment individually and merges them into one file
python scripts/tts/generate_voice.py generate minimax-output/segments.json \
-o minimax-output/output.mp3 --crossfade 200
```
**Do NOT skip Step 2.** Writing segments.json alone does nothing — you MUST run the `generate` command to actually produce audio.
### Voice management
```bash
# List all available voices
python scripts/tts/generate_voice.py list-voices
# Voice cloning (from audio sample, 10s–5min)
python scripts/tts/generate_voice.py clone sample.mp3 --voice-id my-voice
# Voice design (from text description)
python scripts/tts/generate_voice.py design "A warm female narrator voice" --voice-id narrator
```
### Audio processing
```bash
python scripts/tts/generate_voice.py merge part1.mp3 part2.mp3 -o minimax-output/combined.mp3
python scripts/tts/generate_voice.py convert input.wav -o minimax-output/output.mp3
```
### TTS Models
| Model | Notes |
|-------|-------|
| speech-2.8-hd | Recommended, auto emotion matching |
| speech-2.8-turbo | Faster variant |
| speech-2.6-hd | Previous gen, manual emotion |
| speech-2.6-turbo | Previous gen, faster |
### segments.json Format
Default crossfade between segments: **200ms** (`--crossfade 200`).
```json
[
{ "text": "Hello!", "voice_id": "female-shaonv", "emotion": "" },
{ "text": "Welcome.", "voice_id": "male-qn-qingse", "emotion": "happy" }
]
```
Leave `emotion` empty for speech-2.8 models (auto-matched from text).
### IMPORTANT: Multi-Segment Script Generation Rules (Audiobooks, Podcasts, etc.)
When generating segments.json for audiobooks, podcasts, or any multi-character narration, you MUST split narration text from character dialogue into separate segments with distinct voices.
**Rule: Narration and dialogue are ALWAYS separate segments.**
A sentence like `"Tom said: The weather is great today!"` must be split into two segments:
- Segment 1 (narrator voice): `"Tom said:"`
- Segment 2 (character voice): `"The weather is great today!"`
**Example — Audiobook with narrator + 2 characters:**
```json
[
{ "text": "Morning sunlight streamed into the classroom as students filed in one by one.", "voice_id": "narrator-voice", "emotion": "" },
{ "text": "Tom smiled and turned to Lisa:", "voice_id": "narrator-voice", "emotion": "" },
{ "text": "The weather is amazing today! Let's go to the park after school!", "voice_id": "tom-voice", "emotion": "happy" },
{ "text": "Lisa thought for a moment, then replied:", "voice_id": "narrator-voice", "emotion": "" },
{ "text": "Sure, but I need to drop off my backpack at home first.", "voice_id": "lisa-voice", "emotion": "" },
{ "text": "They exchanged a smile and went back to listening to the lecture.", "voice_id": "narrator-voice", "emotion": "" }
]
```
**Key principles:**
1. **Narrator** uses a consistent neutral narrator voice throughout
2. **Each character*