qwenspeak

ClawSkills 作者 psyb0t

Text-to-speech generation via Qwen3-TTS over SSH. Preset voices, voice cloning, voice design. Use when the user wants to generate speech audio, clone voices, or work with TTS.

安装 / 下载方式

TotalClaw CLI推荐
totalclaw install clawskills:clawskills~psyb0t-qwenspeak
cURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/clawskills%3Aclawskills~psyb0t-qwenspeak/file -o psyb0t-qwenspeak.md
# qwenspeak

YAML-driven text-to-speech over SSH using Qwen3-TTS models.

For installation and deployment, see [references/setup.md](references/setup.md).

## SSH Wrapper

Use `scripts/qwenspeak.sh` for all commands. It handles host, port, and host key acceptance via `QWENSPEAK_HOST` and `QWENSPEAK_PORT` env vars.

```bash
scripts/qwenspeak.sh <command> [args]
scripts/qwenspeak.sh <command> < input_file
scripts/qwenspeak.sh <command> > output_file
```

## TTS Generation

Submit YAML, get a job UUID back immediately, poll for progress. Jobs run sequentially — one at a time, the rest queue up.

```bash
# Get the YAML template
scripts/qwenspeak.sh "tts print-yaml" > job.yaml

# Submit job
scripts/qwenspeak.sh "tts" < job.yaml
# {"id": "550e8400-...", "status": "queued", "total_steps": 3, "total_generations": 7}

# Check progress
scripts/qwenspeak.sh "tts get-job 550e8400"

# Follow job log
scripts/qwenspeak.sh "tts get-job-log 550e8400 -f"

# Download result
scripts/qwenspeak.sh "get hello.wav" > hello.wav
```

## YAML Structure

Global settings + list of steps. Each step loads a model, runs all its generations, then unloads. Settings cascade: global > step > generation.

```yaml
steps:
  - mode: custom-voice
    model_size: 1.7b
    speaker: Ryan
    language: English
    generate:
      - text: "Hello world"
        output: hello.wav
      - text: "I cannot believe this!"
        speaker: Vivian
        instruct: "Speak angrily"
        output: angry.wav

  - mode: voice-design
    generate:
      - text: "Welcome to our store."
        instruct: "A warm, friendly young female voice with a cheerful tone"
        output: welcome.wav

  - mode: voice-clone
    model_size: 1.7b
    ref_audio: ref.wav
    ref_text: "Transcript of reference"
    generate:
      - text: "First line in cloned voice"
        output: clone1.wav
      - text: "Second line"
        output: clone2.wav
```

## Modes

**custom-voice** — Pick from 9 preset speakers. 1.7B supports emotion/style via `instruct`.

**voice-design** — Describe the voice in natural language via `instruct`. 1.7B only.

**voice-clone** — Clone from reference audio. Set `ref_audio` and `ref_text` at step level to reuse across generations. `x_vector_only: true` skips transcript.

### Emotion trick for cloned voices

Upload references with different emotions, use separate steps:

```bash
scripts/qwenspeak.sh "create-dir refs"
scripts/qwenspeak.sh "put refs/happy.wav" < me_happy.wav
scripts/qwenspeak.sh "put refs/angry.wav" < me_angry.wav
```

```yaml
steps:
  - mode: voice-clone
    ref_audio: refs/happy.wav
    ref_text: "transcript of happy ref"
    generate:
      - text: "Great news everyone!"
        output: happy1.wav

  - mode: voice-clone
    ref_audio: refs/angry.wav
    ref_text: "transcript of angry ref"
    generate:
      - text: "This is unacceptable"
        output: angry1.wav
```

## Job Management

```bash
scripts/qwenspeak.sh "tts list-jobs"              # list all
scripts/qwenspeak.sh "tts list-jobs --json"        # JSON output
scripts/qwenspeak.sh "tts get-job <id>"            # job details
scripts/qwenspeak.sh "tts get-job-log <id>"        # view log
scripts/qwenspeak.sh "tts get-job-log <id> -f"     # follow log
scripts/qwenspeak.sh "tts cancel-job <id>"         # cancel
```

Statuses: `queued` → `running` → `completed` | `failed` | `cancelled`

Completed jobs auto-cleaned after 1 day, all jobs after 1 week. UUID prefixes work (e.g. first 8 chars).

## File Operations

All paths relative to the work directory. Traversal blocked.

| Command                | Description                        |
| ---------------------- | ---------------------------------- |
| `put <path>`           | Upload file from stdin             |
| `get <path>`           | Download file to stdout            |
| `list-files [--json]`  | List directory                     |
| `remove-file <path>`   | Delete a file                      |
| `create-dir <path>`    | Create directory                   |
| `remove-dir <path>`    | Remove empty directory             |
| `move-file <src> <dst>`| Move or rename                     |
| `copy-file <src> <dst>`| Copy a file                        |
| `file-exists <path>`   | Check if file exists (true/false)  |
| `search-files <glob>`  | Glob search (`**` recursive)       |

## Speakers

| Speaker  | Gender | Language | Description                                    |
| -------- | ------ | -------- | ---------------------------------------------- |
| Vivian   | Female | Chinese  | Bright, slightly edgy young voice              |
| Serena   | Female | Chinese  | Warm, gentle young voice                       |
| Uncle_Fu | Male   | Chinese  | Seasoned, low mellow timbre                    |
| Dylan    | Male   | Chinese  | Youthful Beijing dialect, clear natural timbre |
| Eric     | Male   | Chinese  | Lively Chengdu/Sichuan dialect, slightly husky |
| Ryan     | Male   | English  | Dynamic with strong rhythmic drive             |
| Aiden    | Male   | English  | Sunny American, clear midrange                 |
| Ono_Anna | Female | Japanese | Playful, light nimble timbre                   |
| Sohee    | Female | Korean   | Warm with rich emotion                         |

## YAML Options

All settings cascade: global > step > generation.

| Field                | Default   | Description                                                         |
| -------------------- | --------- | ------------------------------------------------------------------- |
| `dtype`              | `float32` | float32, float16, bfloat16 (float16/bfloat16 GPU only)             |
| `flash_attn`         | `auto`    | FlashAttention-2: auto-detects, auto-switches float32→bfloat16     |
| `temperature`        | `0.9`     | Sampling temperature                                                |
| `top_k`              | `50`      | Top-k sampling                                                      |
| `top_p`              | `1.0`     | Top-p / nucleus sampling                                            |
| `repetition_penalty` | `1.05`    | Repetition penalty                                                  |
| `max_new_tokens`     | `2048`    | Max codec tokens to generate                                        |
| `no_sample`          | `false`   | Greedy decoding                                                     |
| `streaming`          | `false`   | Streaming mode (lower latency)                                      |
| `mode`               | required  | Step only: `custom-voice`, `voice-design`, or `voice-clone`         |
| `model_size`         | `1.7b`    | Step only: `1.7b` or `0.6b`                                        |
| `text`               | required  | Text to synthesize                                                  |
| `output`             | required  | Output file path                                                    |
| `speaker`            | `Vivian`  | custom-voice: speaker name                                          |
| `language`           | `Auto`    | Language for synthesis                                               |
| `instruct`           | -         | custom-voice: emotion/style; voice-design: voice description        |
| `ref_audio`          | -         | voice-clone: reference audio file path                              |
| `ref_text`           | -         | voice-clone: transcript of reference audio                          |
| `x_vector_only`      | `false`   | voice-clone: use speaker embedding only                             |