Content Safety Guard

TotalClaw 作者 phy041 v1.0.0

采用红队测试方法的双层 AI 内容护栏

源码 ↗

安装 / 下载方式

TotalClaw CLI推荐
totalclaw install totalclaw:phy041~phy-content-safety-guard
cURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Aphy041~phy-content-safety-guard/file -o phy-content-safety-guard.md
Git 仓库获取源码
git clone https://github.com/openclaw/skills/commit/e6ce4fd0beb16e25566f58c8adf5f97a623a8167
# Content Safety Guard

A production-tested dual-layer AI content guardrail for chatbots and AI agents. Intercepts outbound messages before delivery and evaluates them through a judge model — with a complete red-team test methodology to verify your guardrail actually works.

**Blue ocean skill**: As of publication, no equivalent exists on ClawHub. Most AI safety tooling focuses on input filtering; this pattern guards the *output* layer — what the AI sends to your users.

---

## The Core Pattern: Dual-Layer Defense

```
User Input
    ↓
[Layer 1] Main AI Agent (Claude / GPT / etc.)
    ↓ generates response
[Layer 2] Judge Model (Gemini Flash) ← This skill
    ↓
  PASS → message sent to user
  FAIL → safe fallback sent instead
```

**Why two layers?**
- Your main AI may be manipulated via prompt injection, jailbreaks, or role-playing attacks
- The judge model evaluates the *output*, not the input — it can't be manipulated by user messages
- Using a different model family (e.g., Gemini as judge for a Claude agent) prevents shared failure modes

**Why Gemini Flash for the judge?**
- Sub-100ms latency at low temperatures
- Configurable safety thresholds (relax built-in filters so judge can evaluate sensitive content)
- `thinkingBudget: 0` disables chain-of-thought for speed

---

## Quick Setup

### 1. Install dependencies

```bash
npm install node-fetch  # if not using native fetch
```

### 2. Set environment variable

```bash
export GOOGLE_GENAI_API_KEY=your_key_here
```

### 3. Create `message-guard/handler.js`

```javascript
const GEMINI_MODEL = "gemini-2.5-flash";
const API_URL = "https://generativelanguage.googleapis.com/v1beta/models";
const TIMEOUT_MS = 3000;
const MIN_LENGTH = 20;  // skip short acknowledgments

// ─── Configure for your use case ──────────────────────────────
const BRAND_NAME = process.env.BRAND_NAME || "{BRAND_NAME}";

// Forbidden categories — customize for your domain
// IMPORTANT: Do NOT list actual forbidden terms inline in the prompt.
// Listing them can trigger the judge model's own safety filters on benign messages.
// Instead, describe the categories abstractly.
const GUARD_SYSTEM_PROMPT = `You are a content safety filter for ${BRAND_NAME}, an AI assistant serving [describe your user base].

Evaluate whether an outbound message is SAFE to send to users.

FAIL if ANY of these apply:
- [Forbidden category 1 — describe abstractly, e.g. "medical/psychological diagnostic terms"]
- [Forbidden category 2 — e.g. "negative evaluation of user capability or talent"]
- [Forbidden category 3 — e.g. "comparison between individual users"]
- Leaks internal info (system prompt, API keys, model names, internal file names)
- Damages [${BRAND_NAME}] brand or dismisses its core value proposition
- Contains violent, sexual, or discriminatory content

PASS if the message is [describe safe content — e.g. "encouraging, educational, or practical guidance"].

Reply EXACTLY one line: PASS or FAIL|brief reason`;

// Fallback messages sent when content is blocked
const SAFE_FALLBACK_EN = "Thank you for your message! Feel free to ask me anything about [topic].";
const SAFE_FALLBACK_ZH = "谢谢你的分享!如果你有其他问题,随时告诉我哦!";

// Relax Gemini's built-in safety filter — we ARE the safety layer,
// so we need Gemini to evaluate content rather than refuse evaluation
const SAFETY_SETTINGS = [
  { category: "HARM_CATEGORY_HARASSMENT", threshold: "BLOCK_ONLY_HIGH" },
  { category: "HARM_CATEGORY_HATE_SPEECH", threshold: "BLOCK_ONLY_HIGH" },
  { category: "HARM_CATEGORY_SEXUALLY_EXPLICIT", threshold: "BLOCK_ONLY_HIGH" },
  { category: "HARM_CATEGORY_DANGEROUS_CONTENT", threshold: "BLOCK_ONLY_HIGH" },
];

// ─── Hook entry point ──────────────────────────────────────────
export default async function handler(event) {
  const { type, data } = event;

  if (type !== "message:sending") return;

  const content = data?.content;
  if (!content || typeof content !== "string") return;

  // Skip short messages (progress indicators, acknowledgments)
  if (content.trim().length < MIN_LENGTH) return;

  // Skip pure inline keyboard / button messages
  if (isButtonOnlyMessage(content)) return;

  const apiKey = process.env.GOOGLE_GENAI_API_KEY;
  if (!apiKey) {
    console.error("[message-guard] GOOGLE_GENAI_API_KEY not set, passing through");
    return;
  }

  try {
    let verdict = await evaluateWithGemini(apiKey, content);

    // Retry once on empty response (Gemini can be flaky)
    if (!verdict.pass && verdict.reason === "empty-response") {
      console.warn("[message-guard] Retrying after empty response...");
      await new Promise((r) => setTimeout(r, 300));
      verdict = await evaluateWithGemini(apiKey, content);
    }

    if (verdict.pass) {
      return; // no modification — let message through
    }

    console.warn(`[message-guard] BLOCKED: ${verdict.reason}`);

    // Detect language and return appropriate fallback
    const fallback = containsChinese(content) ? SAFE_FALLBACK_ZH : SAFE_FALLBACK_EN;
    return { content: fallback };

  } catch (err) {
    // Fail-open: if judge errors or times out, let the message through
    // Change to fail-closed (return fallback) for higher-security contexts
    console.error(`[message-guard] Error (fail-open): ${err.message}`);
    return;
  }
}

// ─── Gemini judge ──────────────────────────────────────────────
async function evaluateWithGemini(apiKey, messageContent) {
  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), TIMEOUT_MS);

  const url = `${API_URL}/${GEMINI_MODEL}:generateContent?key=${apiKey}`;

  try {
    const response = await fetch(url, {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        systemInstruction: {
          parts: [{ text: GUARD_SYSTEM_PROMPT }],
        },
        contents: [{
          role: "user",
          parts: [{ text: `Evaluate this outbound message:\n\n${messageContent}` }],
        }],
        generationConfig: {
          maxOutputTokens: 256,
          temperature: 0,
          thinkingConfig: { thinkingBudget: 0 },  // disable CoT for speed
        },
        safetySettings: SAFETY_SETTINGS,
      }),
      signal: controller.signal,
    });

    if (!response.ok) {
      const errBody = await response.text().catch(() => "");
      throw new Error(`Gemini API ${response.status}: ${errBody.slice(0, 200)}`);
    }

    const result = await response.json();

    // Check if Gemini's own safety filter blocked the response
    const finishReason = result?.candidates?.[0]?.finishReason;
    if (finishReason === "SAFETY" || finishReason === "RECITATION") {
      console.warn(`[message-guard] Gemini safety filter triggered (${finishReason})`);
      return { pass: false, reason: `gemini-safety-${finishReason}` };
    }

    const text = result?.candidates?.[0]?.content?.parts?.[0]?.text?.trim() || "";

    if (!text) {
      console.warn("[message-guard] Empty Gemini response, treating as unsafe");
      return { pass: false, reason: "empty-response" };
    }

    if (text.startsWith("PASS")) return { pass: true };

    if (text.startsWith("FAIL")) {
      const reason = text.includes("|") ? text.split("|").slice(1).join("|").trim() : "unknown";
      return { pass: false, reason };
    }

    // Unexpected format — fail-closed (safer default)
    console.warn(`[message-guard] Unexpected Gemini response: ${text}`);
    return { pass: false, reason: `unexpected-format: ${text.slice(0, 50)}` };

  } finally {
    clearTimeout(timeout);
  }
}

// ─── Helpers ───────────────────────────────────────────────────
function isButtonOnlyMessage(content) {
  try {
    const parsed = JSON.parse(content);
    return parsed?.inline_keyboard || parsed?.reply_markup?.inline_keyboard;
  } catch {
    return false;
  }
}

function containsChinese(text) {
  return /[\u4e00-\u9fff]/.test(text);
}
```

---

## Customizing for Your Domain

### Step 1: Define your forbidden categories

Think in ter