IMA Studio TTS

GitHub 作者 IMA Studio (imastudio.com) v1.0.0

Use when generating speech from text (text-to-speech) via IMA Open API. Use for: voice synthesis, TTS,朗读, 语音合成, 配音, 有声内容. Output: audio URL (mp3/wav). Flow: query products → create task → poll until done. Requires IMA API key. This skill targets seed-tts-2.0 only (seed-tts-1.1 is not supported). Default model is seed-tts-2.0.

安装 / 下载方式

TotalClaw CLI推荐

totalclaw install github:LeoYeAI~openclaw-master-skills~ima-tts-ai

cURL直接下载，无需登录

curl -fsSL https://skills.taituai.com/api/skills/github%3ALeoYeAI~openclaw-master-skills~ima-tts-ai/file -o ima-tts-ai.md

# IMA TTS (Text-to-Speech)

## Overview

Call IMA Open API to create **text-to-speech** audio. Same flow as other IMA creation skills: **query products → create task → poll until done**. Task type is `text_to_speech`. **This skill targets seed-tts-2.0 only** — seed-tts-1.1 is not supported; the script defaults to `seed-tts-2.0` when no model is specified.

## ⚙️ How This Skill Works

This skill uses a bundled Python script `scripts/ima_tts_create.py` to call the IMA Open API:

- Sends **text (prompt)** to `https://api.imastudio.com`
- Uses `--user-id` only locally for preference storage
- Returns an **audio URL** when synthesis is complete
- **Reflection mechanism**: on create failure, retries up to 3 times with parameter adjustments

**What gets sent to IMA:** prompt (text to speak), model selection, parameters (e.g. voice_id, speed). **Not sent:** API key in prompt body; user_id is local only.

### Agent Execution

Use the bundled script:

```bash
# List available TTS models (optional; default is seed-tts-2.0)
python3 {baseDir}/scripts/ima_tts_create.py --api-key $IMA_API_KEY --list-models

# Generate speech (default model: seed-tts-2.0; omit --model-id to use default)
python3 {baseDir}/scripts/ima_tts_create.py \
  --api-key $IMA_API_KEY \
  --model-id seed-tts-2.0 \
  --prompt "Text to be spoken here." \
  --user-id {user_id} \
  --output-json
```

Script outputs JSON; parse it for `url` and pass to the user via the UX protocol below.

---

## Environment

Base URL: `https://api.imastudio.com`

| Header | Required | Value |
|--------|----------|-------|
| `Authorization` | ✅ | `Bearer ima_your_api_key_here` |
| `x-app-source` | ✅ | `ima_skills` |
| `x_app_language` | recommended | `en` / `zh` |

---

## ⚠️ MANDATORY: Always Query Product List First

You **MUST** call `/open/v1/product/list` with `category=text_to_speech` before creating any task. `attribute_id` is required; if 0 or missing → `"Invalid product attribute"` and task fails.

```python
GET /open/v1/product/list?app=ima&platform=web&category=text_to_speech
```

Then traverse the V2 tree: `type=2` = model groups, `type=3` = versions (leaves). Only `type=3` nodes have `credit_rules` and `form_config`. Use a leaf’s `model_id`, `id` (= model_version), and `credit_rules[0].attribute_id` / `points` for create.

---

## Core Flow

```
1. GET /open/v1/product/list?app=ima&platform=web&category=text_to_speech
   → Get attribute_id, credit, model_version, form_config

2. POST /open/v1/tasks/create
   → task_type: "text_to_speech", parameters[].parameters.prompt = text to speak

3. POST /open/v1/tasks/detail  { "task_id": "..." }
   → Poll every 2–5s until medias[].resource_status == 1 and status != "failed"
   → Read medias[].url (and optional duration_str, format)
```

---

## Task Detail API — Actual Response Shape

Poll `POST /open/v1/tasks/detail` until completion. Response uses the same structure as other IMA audio tasks:

| Field | Type | Meaning |
|-------|------|--------|
| `resource_status` | int or null | 0=处理中, 1=可用, 2=失败, 3=已删除；null 视为 0 |
| `status` | string | "pending" / "processing" / "success" / "failed" |
| `url` | string | Audio URL when resource_status=1 (mp3/wav) |
| `duration_str` | string | Optional, e.g. "30s" |
| `format` | string | Optional, e.g. "mp3", "wav" |

**Completed success example:**

```json
{
  "id": "task_xxx",
  "medias": [{
    "resource_status": 1,
    "status": "success",
    "url": "https://cdn.../output.mp3",
    "duration_str": "12s",
    "format": "mp3"
  }]
}
```

**Rules:**

- Treat `resource_status: null` as 0 (processing).
- Success only when **all** medias have `resource_status == 1` and `status != "failed"`.
- On `resource_status == 2` or `status == "failed"`, stop and handle error (e.g. use `error_msg` / `remark`).

---

## API 2: Create Task

```
POST /open/v1/tasks/create
```

**text_to_speech** — no image input. `src_img_url: []`, `input_images: []`.

```json
{
  "task_type": "text_to_speech",
  "enable_multi_model": false,
  "src_img_url": [],
  "parameters": [{
    "attribute_id":  "<from credit_rules>",
    "model_id":      "<model_id>",
    "model_name":    "<model_name>",
    "model_version": "<version_id>",
    "app":           "ima",
    "platform":      "web",
    "category":      "text_to_speech",
    "credit":        "<points>",
    "parameters": {
      "prompt":       "Text to be spoken.",
      "n":            1,
      "input_images": [],
      "cast":         {"points": "<points>", "attribute_id": "<attribute_id>"}
    }
  }]
}
```

`prompt` must be inside `parameters[].parameters`, not at top level. Extra fields (e.g. voice_id, speed) come from product `form_config`; include only those present in the product’s credit_rules/form_config.

Response: `data.id` = task_id for polling.

---

## Supported Task Type & Models

| category | Capability | Input |
|----------|------------|-------|
| `text_to_speech` | Text → Speech | prompt (text to speak) |

**Models:** This skill supports **seed-tts-2.0** only (seed-tts-1.1 is not supported). The script defaults to `--model-id seed-tts-2.0` when none is provided. For current `attribute_id` and `credit`, the script reads from the product list at runtime.

### seed-tts-2.0 — Verified request parameters

The following `parameters[].parameters` shape has been verified to work for **seed-tts-2.0** (attribute_id/credit come from product list and may differ by app/platform):

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `prompt` | string | ✅ | Text to speak (合成文本). |
| `n` | int | ✅ | Usually 1. |
| `model` | string | ✅ | Sub-model: `seed-tts-2.0-expressive` (default) or `seed-tts-2.0-standard`. |
| `speaker` | string | optional | Speaker ID / 发音人，e.g. `zh_male_sophie_uranus_bigtts`（[音色列表 1257544](https://www.volcengine.com/docs/6561/1257544) 中原生 voice_type）. **注意：** 使用原生格式（如 `zh_male_*_uranus_bigtts`），不支持 `BV*_streaming` 格式。 |
| `audio_params` | object | optional | `emotion`（情感）、`speech_rate`（语速 [-50,100]）、`loudness_rate`（音量 [-50,100]）等，见 [1598757 请求 Body](https://www.volcengine.com/docs/6561/1598757?lang=zh). |
| `additions` | object | optional | e.g. `{"explicit_language": "crosslingual", "context_texts": []}`. |
| `cast` | object | ✅ | `{"points": <credit>, "attribute_id": <attribute_id>}` from product list. |

**Script example with extra params:**

```bash
python3 ima_tts_create.py --api-key $IMA_API_KEY --model-id seed-tts-2.0 \
  --prompt "阳光青年音色测试，你好世界。" \
  --extra-params '{"model":"seed-tts-2.0-expressive","speaker":"zh_male_sophie_uranus_bigtts","audio_params":{"emotion":"neutral"},"additions":{"explicit_language":"crosslingual","context_texts":[]}}' \
  --output-json
```

**Note:** The script gets `attribute_id` and `credit` from the product list (e.g. `app=ima&platform=web` → often 2 pts / attribute_id 4419 for seed-tts-2.0). If you have a different app/platform (e.g. webAgent), the product list may return different credit_rules (e.g. 5 pts / attribute_id 8987); the script uses whatever the product list returns for the chosen model.

**Speaker / 音色列表（seed-tts-2.0 兼容火山引擎音色）：** 完整音色 ID 与场景分类见项目内 `volcengine_tts_timbre_list.json`。该文件来自 [火山引擎豆包语音合成音色列表](https://www.volcengine.com/docs/6561/1257544)，使用原生 `voice_type` 格式（如 `zh_male_sophie_uranus_bigtts` 魅力苏菲、`zh_female_vv_uranus_bigtts` Vivi）。**⚠️ 注意：** IMA API 只支持原生格式（`*_uranus_bigtts` 系列），不支持 `BV*_streaming` 豆包音色 ID。

**与火山引擎 2.0 文档对照：** 上述参数与 [HTTP Chunked/SSE 单向流式 V3 请求 Body](https://www.volcengine.com/docs/6561/1598757?lang=zh) 一致：`req_params.text` → prompt，`req_params.speaker` → speaker（必填项），`req_params.model` → model（expressive/standard），`req_params.audio_params`（emotion、speech_rate、loudness_rate 等），`req_params.additions`（如 explicit_language）。2.0 能力说明见 [豆包语音合成2.0能力介绍](https://www.volcengine.com/docs/6561/1871062?lang=zh)（语音指令、引用上文、语音标签等）。

---

## 🎤 当用户说「帮我制作旁白/配音」时如何询问

当用户表达「帮我制作旁白」「做一段配音」「把这段文字读出来」等意图时，**必须先收集关键信息再调用脚本**，避免缺参或盲目默认。

### 必问

| 询问项 | 对应参数 | 说明 |
|--------|-------