sentry-alert-tuner

TotalClaw 作者 kn7d5eszdfwftk153ymhdm4qhs83qsqy v1.0.0

通过彻底调整问题分组、指纹规则、严重性映射、采样率、发送前过滤器、源映射管道和发布健康门,减少 Sentry 警报疲劳。作为一名高级 SRE,他通过独角兽规模的流量来维护 Sentry 安装,其中一次错误的部署可能会引发 80,000 个警报。涵盖 Sentry SaaS 和自托管 (Sentry 24.x)、问题与性能、重播与分析、集成速率限制(Slack、PagerDuty、Opsgenie、Jira)以及发布健康采用/无崩溃会话门。建立一个收件箱卫生手册,以适应人员流动。当警报嘈杂、待命轮换讨厌哨兵、账单不断攀升或问题计数不可读时使用。在“哨兵”、“哨兵警报”、“警报疲劳”、“指纹”、“哨兵收件箱”、“问题分组”、“发送前”、“采样率”、“跟踪采样率”、“配置文件采样率”、“释放健康状况”、“源图”、“无崩溃”、“哨兵噪音”、“哨兵帐单”、“哨兵调整”上触发。

安装 / 下载方式

TotalClaw CLI推荐
totalclaw install totalclaw:kn7d5eszdfwftk153ymhdm4qhs83qsqy~sentry-alert-tuner
cURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Akn7d5eszdfwftk153ymhdm4qhs83qsqy~sentry-alert-tuner/file -o sentry-alert-tuner.md
## 概述(中文)

通过彻底调整问题分组、指纹规则、严重性映射、采样率、发送前过滤器、源映射管道和发布健康门,减少 Sentry 警报疲劳。作为一名高级 SRE,他通过独角兽规模的流量来维护 Sentry 安装,其中一次错误的部署可能会引发 80,000 个警报。涵盖 Sentry SaaS 和自托管 (Sentry 24.x)、问题与性能、重播与分析、集成速率限制(Slack、PagerDuty、Opsgenie、Jira)以及发布健康采用/无崩溃会话门。建立一个收件箱卫生手册,以适应人员流动。当警报嘈杂、待命轮换讨厌哨兵、账单不断攀升或问题计数不可读时使用。在“哨兵”、“哨兵警报”、“警报疲劳”、“指纹”、“哨兵收件箱”、“问题分组”、“发送前”、“采样率”、“跟踪采样率”、“配置文件采样率”、“释放健康状况”、“源图”、“无崩溃”、“哨兵噪音”、“哨兵帐单”、“哨兵调整”上触发。

## 原文

# Sentry Alert Tuner

Tune a Sentry installation so that every alert that fires is worth a human looking at it. Acts as a senior SRE who has owned Sentry for an org with 200+ services, six-figure event volume per minute, and an on-call rotation that has been burned by every form of Sentry noise: sourcemap collapses, third-party widget storms, retry loops, browser extension errors, and the classic "we deployed a typo and got 14,000 alerts" scenario.

This skill does not write detection logic from scratch and does not replace your incident response process. It assumes Sentry is already installed and ingesting events; the job is to make those events actionable. Output is a set of concrete configuration changes (project settings, fingerprint rules, alert rules, SDK init code, CI sourcemap steps) plus a recurring Inbox hygiene playbook.

## Usage

Invoke when:

- The on-call rotation says "I just mute Sentry now"
- A single deploy generated more alerts than humans on the team
- Issues counts are dominated by browser extension noise or third-party scripts
- The Sentry bill jumped 3x without a traffic change
- New errors are buried under the same 12 long-tail issues from 2023
- Alert severities all look the same (everything is P2)
- Slack #alerts is unreadable; people muted the channel
- A postmortem revealed the real error was filed but auto-resolved or grouped into a megaissue

**Basic invocations:**
> Tune Sentry for our React + Django stack — alerts are 90% noise
> We just hit our event quota mid-month, sample rates need rethinking
> Build a fingerprint ruleset so retries don't spawn 400 issues per outage
> Set up release-health gates so a bad deploy auto-pauses rollout

## Inputs Required

- Sentry org slug + project list (or a project export)
- Stack: SDKs in use (browser, Node, Python, Go, Java, mobile) and their versions
- Current event volume per project per day (last 30 days)
- Current Issues count per project, ranked by event count
- Current alert rules and integration list (Slack channels, PagerDuty services)
- Release cadence and the existing release-health thresholds (if any)
- Sourcemap pipeline status: which projects upload, via which CI step
- The five most-hated noisy issues (the ones the team has muted manually)
- Any compliance constraints on PII scrubbing (GDPR, HIPAA, PCI)

## Workflow

1. **Pull the inventory.** Use the Sentry API (`/api/0/projects/{org}/`, `/api/0/organizations/{org}/issues/`, `/api/0/projects/{org}/{proj}/rules/`) to dump every project, alert rule, integration, environment, and the top 200 issues by event volume per project. Cache locally as JSON; the audit reruns weekly.

2. **Classify each project by tier.** T0 (customer-facing critical path: checkout, auth, payments), T1 (important but not revenue-blocking: dashboard, search), T2 (internal tools, batch jobs, marketing site), T3 (experiments, prototypes). Tier dictates sample rate, alert routing, and release-health strictness — not project type.

3. **Audit the top 50 issues per project.** For each, decide one of: keep-as-is, regroup-with-fingerprint, ignore-permanently, filter-at-source (before-send), or fix-the-bug. Most "noise" is actually one of the middle three; only ~10-20% needs an actual code fix.

4. **Write fingerprint rules.** Use Sentry's *Issue Grouping* settings (Project → Settings → Issue Grouping → Fingerprint Rules / Stack Trace Rules). Group by stable signal (route + status code, exception class + module), not by message text (which contains user input, ids, locales).

5. **Write inbound filters.** Project → Settings → Inbound Filters covers the boring 60% (browser extensions, web crawlers, legacy browsers, localhost). Turn ALL of them on by default for browser projects. They deduct from quota *before* ingestion — free win.

6. **Write before-send filters in SDK init.** For everything inbound filters can't catch (third-party script noise, retry storms, expected `4xx` from form validation). Before-send returns `null` to drop. This is where the surgical work happens.

7. **Set sample rates by event type and tier.** `tracesSampleRate`, `profilesSampleRate`, `replaysSessionSampleRate`, `replaysOnErrorSampleRate` — each one independent. Errors are always 1.0 (you want all errors); transactions and replays are sampled. Use the formulas in the Performance section.

8. **Tune severity mapping.** Most teams leave every alert at default. Build a severity matrix (P0-P4) tied to Sentry alert rule conditions: `level`, `event.tags`, `release.health.crash_free_rate`, frequency thresholds, regressions. P0 pages PagerDuty; P3/P4 only post to Slack.

9. **Wire release-health gates.** `crash-free-sessions < 99.5%` blocks rollout. `sessions_errored > X% delta vs prior release` triggers an auto-rollback alert. This requires the SDK to emit sessions (default in newer SDKs, opt-in in older).

10. **Lock down sourcemaps.** Without sourcemaps, every minified-stack issue is its own group; cleanup is impossible. Audit: every prod release uploads sourcemaps in CI, debug ids match (Sentry CLI 2.x+), and `Sentry.init()` has a stable `release` value derived from the same git sha.

11. **Right-size integrations.** Slack integration has rate limits (Slack drops messages above ~1/sec per channel) — route P2/P3 to a digest channel, not a live channel. PagerDuty integration: one Sentry alert rule = one PagerDuty service, never multiplex. Jira integration: only for P0/P1 or after manual triage, never auto-create from every new issue.

12. **Implement the Inbox hygiene weekly playbook** (see deep section). Inbox is where new issues show up; without weekly hygiene it becomes a 4,000-issue swamp.

13. **Add CI guardrails** so future deploys don't undo the work: lint SDK init code for required `beforeSend`, fail CI if release tag isn't set, block merges that introduce a new Sentry alert without a runbook link.

14. **Schedule the recurring audit.** Monthly: re-rank top issues, regroup fingerprints, prune stale alert rules, review tier assignments. Quarterly: re-baseline sample rates, review crash-free thresholds, audit who-owns-what.

## Fingerprint Customization Recipes

Sentry's default grouping is stack-trace-based. It works ~70% of the time. The remaining 30% is where alert fatigue lives. Below are battle-tested fingerprint rules. Apply via *Project Settings → Issue Grouping → Fingerprint Rules*.

**Recipe 1 — Group HTTP errors by route + status:** Without this, `GET /users/123` and `GET /users/124` become separate issues for the same NotFound bug.
```
error.type:HTTPError http.status_code:404 -> {{ transaction }}-404
error.type:HTTPError http.status_code:5* -> {{ transaction }}-{{ http.status_code }}
```

**Recipe 2 — Group third-party SDK errors under one umbrella:** Stripe, Segment, Intercom, Datadog RUM, etc. — their errors are not yours, but they fire constantly.
```
stack.module:"node_modules/stripe/*" -> third-party-stripe
stack.module:"node_modules/@segment/*" -> third-party-segment
stack.abs_path:"*intercom*" -> third-party-intercom
```

**Recipe 3 — Collapse network errors by error class, not message:** `fetch failed: ECONNRESET 10.0.0.42:443` and `fetch failed: ECONNRESET 10.0.0.43:443` are the same bug.
```
error.type:"NetworkError" -> network-{{ error.value | regex:"E[A-Z]+" }}
```

**Recipe 4 — Split a megaissue by environment:** Sometimes one issue is actually three: the staging variant, the prod variant, the canary variant.
```
error.type:DatabaseError -> {{ default }}-{{ environment }}
```

**Recipe 5 — Group browser extension noise:** Inbound filter catches the o