sentry-alert-tuner

ClawSkills 作者 kn7d5eszdfwftk153ymhdm4qhs83qsqy v1.0.0

Reduce Sentry alert fatigue by surgically tuning issue grouping, fingerprint rules, severity mapping, sample rates, before-send filters, sourcemap pipelines, and release-health gates. Acts as a senior SRE who has nursed Sentry installations through unicorn-scale traffic where a single bad deploy could fire 80,000 alerts. Covers Sentry SaaS and self-hosted (Sentry 24.x), Issues vs Performance vs Replays vs Profiling, integrations rate limiting (Slack, PagerDuty, Opsgenie, Jira), and release-health adoption / crash-free-session gates. Builds an Inbox hygiene playbook that survives turnover. Use when alerts are noisy, the on-call rotation hates Sentry, the bill is climbing, or Issues counts are unreadable. Triggers on "sentry", "sentry alerts", "alert fatigue", "fingerprint", "sentry inbox", "issue grouping", "before-send", "sample rate", "traces sample rate", "profiles sample rate", "release health", "sourcemap", "crash-free", "sentry noise", "sentry bill", "sentry tuning".

安装 / 下载方式

TotalClaw CLI推荐

totalclaw install clawskills:kn7d5eszdfwftk153ymhdm4qhs83qsqy~sentry-alert-tuner

cURL直接下载，无需登录

curl -fsSL https://skills.taituai.com/api/skills/clawskills%3Akn7d5eszdfwftk153ymhdm4qhs83qsqy~sentry-alert-tuner/file -o sentry-alert-tuner.md

# Sentry Alert Tuner

Tune a Sentry installation so that every alert that fires is worth a human looking at it. Acts as a senior SRE who has owned Sentry for an org with 200+ services, six-figure event volume per minute, and an on-call rotation that has been burned by every form of Sentry noise: sourcemap collapses, third-party widget storms, retry loops, browser extension errors, and the classic "we deployed a typo and got 14,000 alerts" scenario.

This skill does not write detection logic from scratch and does not replace your incident response process. It assumes Sentry is already installed and ingesting events; the job is to make those events actionable. Output is a set of concrete configuration changes (project settings, fingerprint rules, alert rules, SDK init code, CI sourcemap steps) plus a recurring Inbox hygiene playbook.

## Usage

Invoke when:

- The on-call rotation says "I just mute Sentry now"
- A single deploy generated more alerts than humans on the team
- Issues counts are dominated by browser extension noise or third-party scripts
- The Sentry bill jumped 3x without a traffic change
- New errors are buried under the same 12 long-tail issues from 2023
- Alert severities all look the same (everything is P2)
- Slack #alerts is unreadable; people muted the channel
- A postmortem revealed the real error was filed but auto-resolved or grouped into a megaissue

**Basic invocations:**
> Tune Sentry for our React + Django stack — alerts are 90% noise
> We just hit our event quota mid-month, sample rates need rethinking
> Build a fingerprint ruleset so retries don't spawn 400 issues per outage
> Set up release-health gates so a bad deploy auto-pauses rollout

## Inputs Required

- Sentry org slug + project list (or a project export)
- Stack: SDKs in use (browser, Node, Python, Go, Java, mobile) and their versions
- Current event volume per project per day (last 30 days)
- Current Issues count per project, ranked by event count
- Current alert rules and integration list (Slack channels, PagerDuty services)
- Release cadence and the existing release-health thresholds (if any)
- Sourcemap pipeline status: which projects upload, via which CI step
- The five most-hated noisy issues (the ones the team has muted manually)
- Any compliance constraints on PII scrubbing (GDPR, HIPAA, PCI)

## Workflow

1. **Pull the inventory.** Use the Sentry API (`/api/0/projects/{org}/`, `/api/0/organizations/{org}/issues/`, `/api/0/projects/{org}/{proj}/rules/`) to dump every project, alert rule, integration, environment, and the top 200 issues by event volume per project. Cache locally as JSON; the audit reruns weekly.

2. **Classify each project by tier.** T0 (customer-facing critical path: checkout, auth, payments), T1 (important but not revenue-blocking: dashboard, search), T2 (internal tools, batch jobs, marketing site), T3 (experiments, prototypes). Tier dictates sample rate, alert routing, and release-health strictness — not project type.

3. **Audit the top 50 issues per project.** For each, decide one of: keep-as-is, regroup-with-fingerprint, ignore-permanently, filter-at-source (before-send), or fix-the-bug. Most "noise" is actually one of the middle three; only ~10-20% needs an actual code fix.

4. **Write fingerprint rules.** Use Sentry's *Issue Grouping* settings (Project → Settings → Issue Grouping → Fingerprint Rules / Stack Trace Rules). Group by stable signal (route + status code, exception class + module), not by message text (which contains user input, ids, locales).

5. **Write inbound filters.** Project → Settings → Inbound Filters covers the boring 60% (browser extensions, web crawlers, legacy browsers, localhost). Turn ALL of them on by default for browser projects. They deduct from quota *before* ingestion — free win.

6. **Write before-send filters in SDK init.** For everything inbound filters can't catch (third-party script noise, retry storms, expected `4xx` from form validation). Before-send returns `null` to drop. This is where the surgical work happens.

7. **Set sample rates by event type and tier.** `tracesSampleRate`, `profilesSampleRate`, `replaysSessionSampleRate`, `replaysOnErrorSampleRate` — each one independent. Errors are always 1.0 (you want all errors); transactions and replays are sampled. Use the formulas in the Performance section.

8. **Tune severity mapping.** Most teams leave every alert at default. Build a severity matrix (P0-P4) tied to Sentry alert rule conditions: `level`, `event.tags`, `release.health.crash_free_rate`, frequency thresholds, regressions. P0 pages PagerDuty; P3/P4 only post to Slack.

9. **Wire release-health gates.** `crash-free-sessions < 99.5%` blocks rollout. `sessions_errored > X% delta vs prior release` triggers an auto-rollback alert. This requires the SDK to emit sessions (default in newer SDKs, opt-in in older).

10. **Lock down sourcemaps.** Without sourcemaps, every minified-stack issue is its own group; cleanup is impossible. Audit: every prod release uploads sourcemaps in CI, debug ids match (Sentry CLI 2.x+), and `Sentry.init()` has a stable `release` value derived from the same git sha.

11. **Right-size integrations.** Slack integration has rate limits (Slack drops messages above ~1/sec per channel) — route P2/P3 to a digest channel, not a live channel. PagerDuty integration: one Sentry alert rule = one PagerDuty service, never multiplex. Jira integration: only for P0/P1 or after manual triage, never auto-create from every new issue.

12. **Implement the Inbox hygiene weekly playbook** (see deep section). Inbox is where new issues show up; without weekly hygiene it becomes a 4,000-issue swamp.

13. **Add CI guardrails** so future deploys don't undo the work: lint SDK init code for required `beforeSend`, fail CI if release tag isn't set, block merges that introduce a new Sentry alert without a runbook link.

14. **Schedule the recurring audit.** Monthly: re-rank top issues, regroup fingerprints, prune stale alert rules, review tier assignments. Quarterly: re-baseline sample rates, review crash-free thresholds, audit who-owns-what.

## Fingerprint Customization Recipes

Sentry's default grouping is stack-trace-based. It works ~70% of the time. The remaining 30% is where alert fatigue lives. Below are battle-tested fingerprint rules. Apply via *Project Settings → Issue Grouping → Fingerprint Rules*.

**Recipe 1 — Group HTTP errors by route + status:** Without this, `GET /users/123` and `GET /users/124` become separate issues for the same NotFound bug.
```
error.type:HTTPError http.status_code:404 -> {{ transaction }}-404
error.type:HTTPError http.status_code:5* -> {{ transaction }}-{{ http.status_code }}
```

**Recipe 2 — Group third-party SDK errors under one umbrella:** Stripe, Segment, Intercom, Datadog RUM, etc. — their errors are not yours, but they fire constantly.
```
stack.module:"node_modules/stripe/*" -> third-party-stripe
stack.module:"node_modules/@segment/*" -> third-party-segment
stack.abs_path:"*intercom*" -> third-party-intercom
```

**Recipe 3 — Collapse network errors by error class, not message:** `fetch failed: ECONNRESET 10.0.0.42:443` and `fetch failed: ECONNRESET 10.0.0.43:443` are the same bug.
```
error.type:"NetworkError" -> network-{{ error.value | regex:"E[A-Z]+" }}
```

**Recipe 4 — Split a megaissue by environment:** Sometimes one issue is actually three: the staging variant, the prod variant, the canary variant.
```
error.type:DatabaseError -> {{ default }}-{{ environment }}
```

**Recipe 5 — Group browser extension noise:** Inbound filter catches the obvious ones. Fingerprint catches the rest.
```
stack.abs_path:"*chrome-extension*" -> browser-extension
stack.abs_path:"*moz-extension*" -> browser-extension
stack.abs_path:"*safari-extension*" -> browser-extension
```

**Recipe 6 — Group retries as one issue:** Background jobs that retry 5 times shouldn't create 5 events. Tag the event with `retry_attempt` and group on attempt 1 only.
```
tags.retry_atte