prompt-engineer-toolkit

TotalClaw 作者 Alireza Rezvani v1.0.0

分析和重写提示以实现更好的 AI 输出,为营销用例(广告文案、电子邮件营销活动、社交媒体)创建可重复使用的提示模板,并构建端到端 AI 内容工作流程。当用户想要改进 AI 辅助营销的提示、构建提示模板或优化 AI 内容工作流程时使用。当用户提及“提示工程”、“改进我的提示”、“AI 写作质量”、“提示模板”或“AI 内容工作流程”时也可使用。

源码 ↗

安装 / 下载方式

TotalClaw CLI推荐
totalclaw install totalclaw:alirezarezvani~prompt-engineer-toolkit
cURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Aalirezarezvani~prompt-engineer-toolkit/file -o prompt-engineer-toolkit.md
Git 仓库获取源码
git clone https://github.com/openclaw/skills/commit/045a3640915c951aa6d47712809273fc5b0f4d75
## 概述(中文)

分析和重写提示以实现更好的 AI 输出,为营销用例(广告文案、电子邮件营销活动、社交媒体)创建可重复使用的提示模板,并构建端到端 AI 内容工作流程。当用户想要改进 AI 辅助营销的提示、构建提示模板或优化 AI 内容工作流程时使用。当用户提及“提示工程”、“改进我的提示”、“AI 写作质量”、“提示模板”或“AI 内容工作流程”时也可使用。

## 原文

# Prompt Engineer Toolkit

## Overview

Use this skill to move prompts from ad-hoc drafts to production assets with repeatable testing, versioning, and regression safety. It emphasizes measurable quality over intuition. Apply it when launching a new LLM feature that needs reliable outputs, when prompt quality degrades after model or instruction changes, when multiple team members edit prompts and need history/diffs, when you need evidence-based prompt choice for production rollout, or when you want consistent prompt governance across environments.

## Core Capabilities

- A/B prompt evaluation against structured test cases
- Quantitative scoring for adherence, relevance, and safety checks
- Prompt version tracking with immutable history and changelog
- Prompt diffs to review behavior-impacting edits
- Reusable prompt templates and selection guidance
- Regression-friendly workflows for model/prompt updates

## Key Workflows

### 1. Run Prompt A/B Test

Prepare JSON test cases and run:

```bash
python3 scripts/prompt_tester.py \
  --prompt-a-file prompts/a.txt \
  --prompt-b-file prompts/b.txt \
  --cases-file testcases.json \
  --runner-cmd 'my-llm-cli --prompt {prompt} --input {input}' \
  --format text
```

Input can also come from stdin/`--input` JSON payload.

### 2. Choose Winner With Evidence

The tester scores outputs per case and aggregates:

- expected content coverage
- forbidden content violations
- regex/format compliance
- output length sanity

Use the higher-scoring prompt as candidate baseline, then run regression suite.

### 3. Version Prompts

```bash
# Add version
python3 scripts/prompt_versioner.py add \
  --name support_classifier \
  --prompt-file prompts/support_v3.txt \
  --author alice

# Diff versions
python3 scripts/prompt_versioner.py diff --name support_classifier --from-version 2 --to-version 3

# Changelog
python3 scripts/prompt_versioner.py changelog --name support_classifier
```

### 4. Regression Loop

1. Store baseline version.
2. Propose prompt edits.
3. Re-run A/B test.
4. Promote only if score and safety constraints improve.

## Script Interfaces

- `python3 scripts/prompt_tester.py --help`
  - Reads prompts/cases from stdin or `--input`
  - Optional external runner command
  - Emits text or JSON metrics
- `python3 scripts/prompt_versioner.py --help`
  - Manages prompt history (`add`, `list`, `diff`, `changelog`)
  - Stores metadata and content snapshots locally

## Pitfalls, Best Practices & Review Checklist

**Avoid these mistakes:**
1. Picking prompts from single-case outputs — use a realistic, edge-case-rich test suite.
2. Changing prompt and model simultaneously — always isolate variables.
3. Missing `must_not_contain` (forbidden-content) checks in evaluation criteria.
4. Editing prompts without version metadata, author, or change rationale.
5. Skipping semantic diffs before deploying a new prompt version.
6. Optimizing one benchmark while harming edge cases — track the full suite.
7. Model swap without rerunning the baseline A/B suite.

**Before promoting any prompt, confirm:**
- [ ] Task intent is explicit and unambiguous.
- [ ] Output schema/format is explicit.
- [ ] Safety and exclusion constraints are explicit.
- [ ] No contradictory instructions.
- [ ] No unnecessary verbosity tokens.
- [ ] A/B score improves and violation count stays at zero.

## References

- [references/prompt-templates.md](references/prompt-templates.md)
- [references/technique-guide.md](references/technique-guide.md)
- [references/evaluation-rubric.md](references/evaluation-rubric.md)
- [README.md](README.md)

## Evaluation Design

Each test case should define:

- `input`: realistic production-like input
- `expected_contains`: required markers/content
- `forbidden_contains`: disallowed phrases or unsafe content
- `expected_regex`: required structural patterns

This enables deterministic grading across prompt variants.

## Versioning Policy

- Use semantic prompt identifiers per feature (`support_classifier`, `ad_copy_shortform`).
- Record author + change note for every revision.
- Never overwrite historical versions.
- Diff before promoting a new prompt to production.

## Rollout Strategy

1. Create baseline prompt version.
2. Propose candidate prompt.
3. Run A/B suite against same cases.
4. Promote only if winner improves average and keeps violation count at zero.
5. Track post-release feedback and feed new failure cases back into test suite.