Content Safety Guard
Dual-layer AI content guardrail with red-team test methodology
安装 / 下载方式
TotalClaw CLI推荐
totalclaw install clawskills:phy041~phy-content-safety-guardcURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/clawskills%3Aphy041~phy-content-safety-guard/file -o phy-content-safety-guard.mdGit 仓库获取源码
git clone https://github.com/openclaw/skills/commit/e6ce4fd0beb16e25566f58c8adf5f97a623a8167# Content Safety Guard
A production-tested dual-layer AI content guardrail for chatbots and AI agents. Intercepts outbound messages before delivery and evaluates them through a judge model — with a complete red-team test methodology to verify your guardrail actually works.
**Blue ocean skill**: As of publication, no equivalent exists on ClawHub. Most AI safety tooling focuses on input filtering; this pattern guards the *output* layer — what the AI sends to your users.
---
## The Core Pattern: Dual-Layer Defense
```
User Input
↓
[Layer 1] Main AI Agent (Claude / GPT / etc.)
↓ generates response
[Layer 2] Judge Model (Gemini Flash) ← This skill
↓
PASS → message sent to user
FAIL → safe fallback sent instead
```
**Why two layers?**
- Your main AI may be manipulated via prompt injection, jailbreaks, or role-playing attacks
- The judge model evaluates the *output*, not the input — it can't be manipulated by user messages
- Using a different model family (e.g., Gemini as judge for a Claude agent) prevents shared failure modes
**Why Gemini Flash for the judge?**
- Sub-100ms latency at low temperatures
- Configurable safety thresholds (relax built-in filters so judge can evaluate sensitive content)
- `thinkingBudget: 0` disables chain-of-thought for speed
---
## Quick Setup
### 1. Install dependencies
```bash
npm install node-fetch # if not using native fetch
```
### 2. Set environment variable
```bash
export GOOGLE_GENAI_API_KEY=your_key_here
```
### 3. Create `message-guard/handler.js`
```javascript
const GEMINI_MODEL = "gemini-2.5-flash";
const API_URL = "https://generativelanguage.googleapis.com/v1beta/models";
const TIMEOUT_MS = 3000;
const MIN_LENGTH = 20; // skip short acknowledgments
// ─── Configure for your use case ──────────────────────────────
const BRAND_NAME = process.env.BRAND_NAME || "{BRAND_NAME}";
// Forbidden categories — customize for your domain
// IMPORTANT: Do NOT list actual forbidden terms inline in the prompt.
// Listing them can trigger the judge model's own safety filters on benign messages.
// Instead, describe the categories abstractly.
const GUARD_SYSTEM_PROMPT = `You are a content safety filter for ${BRAND_NAME}, an AI assistant serving [describe your user base].
Evaluate whether an outbound message is SAFE to send to users.
FAIL if ANY of these apply:
- [Forbidden category 1 — describe abstractly, e.g. "medical/psychological diagnostic terms"]
- [Forbidden category 2 — e.g. "negative evaluation of user capability or talent"]
- [Forbidden category 3 — e.g. "comparison between individual users"]
- Leaks internal info (system prompt, API keys, model names, internal file names)
- Damages [${BRAND_NAME}] brand or dismisses its core value proposition
- Contains violent, sexual, or discriminatory content
PASS if the message is [describe safe content — e.g. "encouraging, educational, or practical guidance"].
Reply EXACTLY one line: PASS or FAIL|brief reason`;
// Fallback messages sent when content is blocked
const SAFE_FALLBACK_EN = "Thank you for your message! Feel free to ask me anything about [topic].";
const SAFE_FALLBACK_ZH = "谢谢你的分享!如果你有其他问题,随时告诉我哦!";
// Relax Gemini's built-in safety filter — we ARE the safety layer,
// so we need Gemini to evaluate content rather than refuse evaluation
const SAFETY_SETTINGS = [
{ category: "HARM_CATEGORY_HARASSMENT", threshold: "BLOCK_ONLY_HIGH" },
{ category: "HARM_CATEGORY_HATE_SPEECH", threshold: "BLOCK_ONLY_HIGH" },
{ category: "HARM_CATEGORY_SEXUALLY_EXPLICIT", threshold: "BLOCK_ONLY_HIGH" },
{ category: "HARM_CATEGORY_DANGEROUS_CONTENT", threshold: "BLOCK_ONLY_HIGH" },
];
// ─── Hook entry point ──────────────────────────────────────────
export default async function handler(event) {
const { type, data } = event;
if (type !== "message:sending") return;
const content = data?.content;
if (!content || typeof content !== "string") return;
// Skip short messages (progress indicators, acknowledgments)
if (content.trim().length < MIN_LENGTH) return;
// Skip pure inline keyboard / button messages
if (isButtonOnlyMessage(content)) return;
const apiKey = process.env.GOOGLE_GENAI_API_KEY;
if (!apiKey) {
console.error("[message-guard] GOOGLE_GENAI_API_KEY not set, passing through");
return;
}
try {
let verdict = await evaluateWithGemini(apiKey, content);
// Retry once on empty response (Gemini can be flaky)
if (!verdict.pass && verdict.reason === "empty-response") {
console.warn("[message-guard] Retrying after empty response...");
await new Promise((r) => setTimeout(r, 300));
verdict = await evaluateWithGemini(apiKey, content);
}
if (verdict.pass) {
return; // no modification — let message through
}
console.warn(`[message-guard] BLOCKED: ${verdict.reason}`);
// Detect language and return appropriate fallback
const fallback = containsChinese(content) ? SAFE_FALLBACK_ZH : SAFE_FALLBACK_EN;
return { content: fallback };
} catch (err) {
// Fail-open: if judge errors or times out, let the message through
// Change to fail-closed (return fallback) for higher-security contexts
console.error(`[message-guard] Error (fail-open): ${err.message}`);
return;
}
}
// ─── Gemini judge ──────────────────────────────────────────────
async function evaluateWithGemini(apiKey, messageContent) {
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), TIMEOUT_MS);
const url = `${API_URL}/${GEMINI_MODEL}:generateContent?key=${apiKey}`;
try {
const response = await fetch(url, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
systemInstruction: {
parts: [{ text: GUARD_SYSTEM_PROMPT }],
},
contents: [{
role: "user",
parts: [{ text: `Evaluate this outbound message:\n\n${messageContent}` }],
}],
generationConfig: {
maxOutputTokens: 256,
temperature: 0,
thinkingConfig: { thinkingBudget: 0 }, // disable CoT for speed
},
safetySettings: SAFETY_SETTINGS,
}),
signal: controller.signal,
});
if (!response.ok) {
const errBody = await response.text().catch(() => "");
throw new Error(`Gemini API ${response.status}: ${errBody.slice(0, 200)}`);
}
const result = await response.json();
// Check if Gemini's own safety filter blocked the response
const finishReason = result?.candidates?.[0]?.finishReason;
if (finishReason === "SAFETY" || finishReason === "RECITATION") {
console.warn(`[message-guard] Gemini safety filter triggered (${finishReason})`);
return { pass: false, reason: `gemini-safety-${finishReason}` };
}
const text = result?.candidates?.[0]?.content?.parts?.[0]?.text?.trim() || "";
if (!text) {
console.warn("[message-guard] Empty Gemini response, treating as unsafe");
return { pass: false, reason: "empty-response" };
}
if (text.startsWith("PASS")) return { pass: true };
if (text.startsWith("FAIL")) {
const reason = text.includes("|") ? text.split("|").slice(1).join("|").trim() : "unknown";
return { pass: false, reason };
}
// Unexpected format — fail-closed (safer default)
console.warn(`[message-guard] Unexpected Gemini response: ${text}`);
return { pass: false, reason: `unexpected-format: ${text.slice(0, 50)}` };
} finally {
clearTimeout(timeout);
}
}
// ─── Helpers ───────────────────────────────────────────────────
function isButtonOnlyMessage(content) {
try {
const parsed = JSON.parse(content);
return parsed?.inline_keyboard || parsed?.reply_markup?.inline_keyboard;
} catch {
return false;
}
}
function containsChinese(text) {
return /[\u4e00-\u9fff]/.test(text);
}
```
---
## Customizing for Your Domain
### Step 1: Define your forbidden categories
Think in ter