datadog-monitor-designer

TotalClaw 作者 kn7d5eszdfwftk153ymhdm4qhs83qsqy v1.0.0

设计 Datadog 监视器，无需寻呼噪音即可捕获实际生产问题。涵盖基于 SLO 的多窗口多消耗率警报监控、阈值/异常/预测/离群值/复合监控类型之间的决策、标签驱动的路由、停机时间窗口、每个服务层的监控模板继承、通知消息工程和运行手册链接。作为一名高级 SRE，他拥有 4,000 多个 Datadog 监视器，并将其削减到 600 个监视器，但仍能捕获每个真实事件。了解 Datadog 计费杠杆（自定义指标、索引日志、摄取跟踪）、“from”/“group_by”如何与无数据评估交互、简单警报和多重警报之间的区别，以及哪些监视器类型默默地昂贵。当监视器在装饰性信号上进行寻呼时、当服务层需要一致的监视器集时、当 SLO 需要转变为消耗率警报时或当监视器蔓延难以管理时使用。在“datadog”、“datadog 监视器”、“监视器设计”、“slo”、“燃烧率”、“异常监视器”、“预测监视器”、“复合监视器”、“警报路由”、“runbook”、“停机时间”、“噪音监视器”、“监视器模板”、“服务层”、“T0”、“T1”上触发。

安装 / 下载方式

TotalClaw CLI推荐

totalclaw install totalclaw:kn7d5eszdfwftk153ymhdm4qhs83qsqy~datadog-monitor-designer

cURL直接下载，无需登录

curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Akn7d5eszdfwftk153ymhdm4qhs83qsqy~datadog-monitor-designer/file -o datadog-monitor-designer.md

## 概述（中文）

## 原文

# Datadog Monitor Designer

Design Datadog monitors that page humans only when paging is justified. Acts as a senior SRE who has run a Datadog org with 4,000 monitors, watched 90% of pages get auto-resolved with no action, and rebuilt the monitor estate around SLOs and tier templates until the false-positive rate fell from 70% to 8%.

This skill builds and tunes monitors. It does not replace your incident response process, your SLO design process, or your service catalog — it consumes them. Outputs are concrete monitor JSON (importable via Datadog API or Terraform `datadog_monitor` resource), notification message templates, downtime windows, and a per-service monitor inventory tied to ownership tags.

## Usage

Invoke when:

- Monitors page on every minor blip; on-call comp is rising
- A new T0 service is shipping and needs its monitor pack from day one
- An existing service has 80 monitors and only 6 ever fire
- SLOs were defined in a doc but never wired to alerts
- A postmortem revealed the right monitor existed but conditioned on the wrong tag set
- The Datadog bill jumped because of `avg by:` cardinality on a custom metric used in monitors
- Monitor message bodies are unparseable; runbook links are missing or rotted
- A monitor fires across all environments because nobody scoped it to `env:prod`

**Basic invocations:**
> Design a T0 monitor pack for our checkout service
> Convert our 99.9% latency SLO into burn-rate alerts
> Audit our 4,000 monitors and tell me which 80% are noise
> Write a notification template the on-call doesn't have to decode

## Inputs Required

- Datadog org + API key (or a monitor export via `datadog-monitor` Terraform / API dump)
- Service catalog: name, tier (T0/T1/T2), team owner, runbook URL, dashboard URL
- SLOs in scope (objective, window, error budget) — Datadog SLO objects or external doc
- Routing targets: PagerDuty service per team, Slack channels, on-call schedule
- Existing tag taxonomy: `env`, `service`, `team`, `criticality`, `version`, `region`
- Recent incident list (last 90 days) — for monitor coverage analysis
- Custom metric inventory + cost (Datadog → Plan & Usage → Custom Metrics)
- Constraints: budget limit, custom metric cap, log indexing limits

## Workflow

1. **Inventory existing monitors.** API: `GET /api/v1/monitor` (paginate). Tag each monitor with: type, service, tier, fired in last 90d?, ack rate, false-positive rate (acks resolved with "no action"), downstream routing. Anything that hasn't fired in 90 days and isn't tied to a documented invariant is a deletion candidate.

2. **Classify by tier.** Cross-reference the service catalog. Every monitor should be tagged `tier:T0|T1|T2|T3` and `team:<owner>`. Monitors without a tier or owner go to a triage list.

3. **Map SLOs to burn-rate alerts.** For each SLO, generate the multi-window multi-burn-rate (MWMB) monitor pair: a fast-burn alert (high rate, short window) and a slow-burn alert (low rate, long window). See the SLO recipe section.

4. **Apply tier templates.** Each tier has a base monitor set: availability, latency, saturation, error rate, dependency health. Generate the tier template with placeholder substitution for service name and metric source.

5. **Pick the right monitor type per signal.** Threshold for known SLAs and SLOs, anomaly for diurnal seasonal metrics, forecast for trend-based saturation (disk, certs, quota), outlier for fleets where one host misbehaves, composite for "A AND B for X minutes" without doubling pages.

6. **Engineer notification messages.** Every monitor message has the same eight required elements (see Notification Anatomy). Use Datadog template variables (`{{value}}`, `{{host.name}}`, `{{ #is_alert }}…{{ /is_alert }}`) for live data; static text for runbook URL and severity.

7. **Wire tag-driven routing.** `@pagerduty-<service>` for P0/P1, `@slack-<channel>` for P2/P3. Routing is in the message body, scoped by `{{ #is_alert }}` so resolved-events don't re-page.

8. **Set downtime windows.** Deploy windows, maintenance windows, known-noisy windows. Use `Downtime` API with scope filters (`env:prod service:checkout`); document expected duration.

9. **Configure no-data behavior.** `notify_no_data: true` is correct for "this metric should always have data" (heartbeats, uptime). For sparse metrics, `notify_no_data: false` plus a separate uptime monitor. Never default to `true` on every monitor — it pages on deploys.

10. **Group by stable dimensions only.** `group by: host` on auto-scaling fleets explodes alert count on scale-up. Group by `service`, `env`, `cluster`, `customer-tier`. Avoid `host` and `pod_name` in monitor groupings unless the monitor is host-specific.

11. **Test the monitor.** Dry-run with `Test Notifications` button or API `POST /api/v1/monitor/{id}/notify`. Verify routing, message rendering, runbook link, severity. Fire-drill once per quarter via the Datadog `mute_status_handle` or by deliberately tripping the threshold in a synthetic.

12. **Document the monitor.** Each monitor has a `runbook_url` tag and the runbook link in the message. The runbook covers: what the monitor means, what to check first, who to escalate to, common causes, common false positives.

13. **Schedule audits.** Monthly: stale monitor pruning, false-positive review. Quarterly: tier template refresh, SLO re-baselining, cost review.

## Monitor Type Decision Tree

Datadog has nine monitor types; most teams use Threshold for everything. That's how you get 4,000 monitors that don't catch real issues. Pick the type that matches the signal shape.

```
What are you alerting on?
├── A static SLA / SLO threshold (latency < 500ms, error rate < 1%)
│ → Metric Threshold monitor
│
├── A trend that crosses a threshold over time (disk fills, quota exhaust, cert expiry)
│ → Forecast monitor (linear or seasonal forecast)
│
├── A metric with a strong daily/weekly seasonality (traffic, signups)
│ → Anomaly monitor (agile, robust, or basic algorithm)
│
├── One host/pod/instance behaving differently from its peers
│ → Outlier monitor (DBSCAN, MAD, or scaledZ)
│
├── A condition that requires multiple signals to all be true
│ → Composite monitor (AND of two metric monitors)
│
├── An event happening (deploy, security finding, audit log)
│ → Event monitor or Event-V2 monitor
│
├── A log pattern occurring at rate
│ → Log monitor (don't use Threshold on a log-based metric — Log monitor is cheaper)
│
├── An external endpoint being reachable
│ → Synthetic monitor (browser or API test)
│
└── A process / service running on a host
→ Process monitor (legacy) or Service Check monitor
```

**Rules of thumb:**
- Anomaly monitors are great for *unexpected* changes but terrible for known invariants. Use them on traffic, not error rate.
- Forecast monitors require >2 weeks of history; don't use on new metrics.
- Outlier monitors silently break when the fleet has <5 members. Set a min-host gate.
- Composite monitors don't multiply cost; they reduce alert count by AND'ing.
- Log monitors index all matched logs — they cost on log volume, not metric count.

## Service Tier Templates

Each tier ships with a fixed monitor pack. A new T0 service goes from zero to fully covered in 15 minutes by importing the template.

### T0 Critical (revenue path, auth, payments)

| Monitor | Type | Threshold | Window | Notify |
|---------|------|-----------|--------|--------|
| Availability (HTTP 5xx rate) | Metric Threshold | >0.5% over 5m | 5m | P0 → PagerDu