qwen3-tts-local-inference
Generate speech from text using Qwen3-TTS via direct Python inference — no server required. Use when: (1) converting text to speech / synthesising audio, (2) creating voiceovers or spoken content, (3) cloning a voice from reference audio, (4) generating TTS with built-in speakers or custom voice descriptions. Supports custom-voice (9 speakers), voice-design (natural language), and voice-clone (~3 s reference). Outputs .wav files. Both 0.6B (small, default) and 1.7B (large) models available. Runs entirely offline after model download.
安装 / 下载方式
TotalClaw CLI推荐
totalclaw install clawskills:clawskills~jithinm-qwen3-tts-local-inferencecURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/clawskills%3Aclawskills~jithinm-qwen3-tts-local-inference/file -o jithinm-qwen3-tts-local-inference.md# Qwen3-TTS — Local Inference (No Server)
Run Qwen3-TTS directly in Python — no HTTP server, no REST API. Call a script
or import the engine in your own code.
## Quick reference
| Mode | What it does | Key args |
|------|-------------|----------|
| **custom-voice** | 9 built-in speakers, optional emotion/style | `--speaker`, `--instruct` |
| **voice-design** | Describe the voice in natural language | `--instruct` (required) |
| **voice-clone** | Clone from ~3 s reference audio | `--ref-audio`, `--ref-text` |
**Available Speakers**
The CustomVoice model includes 9 premium voices:
| Speaker | Language | Description |
|---------|----------|-------------|
| Vivian | Chinese | Bright, slightly edgy young female |
| Serena | Chinese | Warm, gentle young female |
| Uncle_Fu | Chinese | Seasoned male, low mellow timbre |
| Dylan | Chinese (Beijing) | Youthful Beijing male, clear |
| Eric | Chinese (Sichuan) | Lively Chengdu male, husky |
| Ryan | English | Dynamic male, rhythmic |
| Aiden | English | Sunny American male |
| Ono_Anna | Japanese | Playful female, light nimble |
| Sohee | Korean | Warm female, rich emotion |
**Languages:** Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian, Auto
---
## 1 — Setup
Install dependencies once (from the skill directory):
**First-time setup** (one-time):
```bash
bash scripts/setup.sh
```
Custom download location:
```bash
python scripts/download_models.py --model-dir /path/to/models
```
Models are stored under `{baseDir}/models/` by default. Override with
`QWEN_TTS_MODEL_DIR` env var or `--model-dir` flag.
---
## 2 — Generate speech (CLI)
### Custom Voice (default)
```bash
cd {baseDir}
python scripts/tts.py "Hello, how are you today?" --speaker Ryan --language English
```
With emotion/style instruction:
```bash
python scripts/tts.py "Great news everyone!" --speaker Aiden --instruct "cheerful and energetic"
```
### Voice Design
Describe the voice in natural language:
```bash
python scripts/tts.py "Welcome to our show!" \
--mode voice-design \
--language English \
--instruct "Warm, confident female voice in her 30s with a slight British accent"
```
### Voice Clone
Clone a voice from a short (~3 s) reference audio clip:
```bash
python scripts/tts.py "This is spoken in the cloned voice." \
--mode voice-clone \
--language English \
--ref-audio path/to/reference.wav \
--ref-text "Transcript of the reference audio."
```
### Common options
| Flag | Purpose |
|------|---------|
| `-o output.wav` | Save to exact file path instead of auto-named file |
| `--output-dir DIR` | Override output directory (default: `tts_output/`) |
| `--model-dir DIR` | Override model directory |
| `--json` | Print result as JSON |
| `-v` | Verbose logging |
---
## 3 — Python API
Use the engine directly in code:
```python
import sys
sys.path.insert(0, "{baseDir}/scripts")
from inference import TTSInferenceEngine
engine = TTSInferenceEngine(
model_dir="{baseDir}/models", # optional, uses default if omitted
output_dir="./tts_output", # optional
)
result = engine.generate_custom_voice(
text="Hello world!",
language="English",
speaker="Ryan",
instruct="calm and professional",
)
print(result)
# {"file": "tts_output/custom_voice_20260218_...wav", "duration_s": 1.23, "inference_s": 4.56}
```
Available methods:
- `engine.generate_custom_voice(text, language, speaker, instruct)`
- `engine.generate_voice_design(text, language, instruct)`
- `engine.generate_voice_clone(text, language, ref_audio, ref_text)`
- `engine.status()` — returns loaded variant, device, paths
---
## 4 — Configuration
All settings are controlled via environment variables. Set them before running.
| Variable | Default | Description |
|----------|---------|-------------|
| `QWEN_TTS_MODEL_SIZE` | `small` | `small` (0.6B) or `large` (1.7B) |
| `QWEN_TTS_MODEL_DIR` | `{baseDir}/models` | Where model weights are stored |
| `QWEN_TTS_DEVICE` | auto (`cuda:0` or `cpu`) | Inference device |
| `QWEN_TTS_DTYPE` | auto (`bfloat16` / `float32`) | Model precision |
| `QWEN_TTS_OUTPUT_DIR` | `./tts_output` | Where generated .wav files are saved |
Switch to the 1.7B model:
```bash
set QWEN_TTS_MODEL_SIZE=large
python scripts/tts.py "Hello world"
```
Use a custom model directory:
```bash
set QWEN_TTS_MODEL_DIR=D:\my-models\qwen-tts
python scripts/tts.py "Hello world"
```
---
## Important notes
- **Small model (0.6B) is the default.** It uses less RAM and is faster.
Switch to `large` (1.7B) for higher quality.
- **CPU inference is slow.** Expect 30-120 s per sentence for the 1.7B model.
The 0.6B model is roughly 2x faster.
- Only **one model variant** is loaded at a time. Switching modes (e.g.
custom-voice to voice-clone) triggers a model swap.
- Output `.wav` files land in `tts_output/` by default.
- Models are downloaded to `{baseDir}/models/` by default. Run
`download_models.py --size all` to pre-download both sizes for offline use.
- Voice Design mode has **no 0.6B variant** — it always uses the 1.7B model
regardless of `QWEN_TTS_MODEL_SIZE`.