datadog-monitor-designer

ClawSkills 作者 kn7d5eszdfwftk153ymhdm4qhs83qsqy v1.0.0

Design Datadog monitors that catch real production issues without paging on noise. Covers SLO-based monitoring with multi-window multi-burn-rate alerts, the decision between threshold/anomaly/forecast/outlier/composite monitor types, tag-driven routing, downtime windows, monitor template inheritance per service tier, notification message engineering, and runbook linking. Acts as a senior SRE who has owned 4,000+ Datadog monitors and pruned them down to 600 monitors that still catch every real incident. Knows the Datadog billing levers (custom metrics, indexed logs, ingested traces), how `from`/`group_by` interact with no-data evaluation, the difference between simple alerts and multi-alerts, and which monitor types are silently expensive. Use when monitors are paging on cosmetic blips, when a service tier needs a coherent monitor set, when SLOs need turning into burn-rate alerts, or when monitor sprawl is unmanageable. Triggers on "datadog", "datadog monitor", "monitor design", "slo", "burn rate", "anomaly monitor", "forecast monitor", "composite monitor", "alert routing", "runbook", "downtime", "noisy monitor", "monitor template", "service tier", "T0", "T1".

安装 / 下载方式

TotalClaw CLI推荐

totalclaw install clawskills:kn7d5eszdfwftk153ymhdm4qhs83qsqy~datadog-monitor-designer

cURL直接下载，无需登录

curl -fsSL https://skills.taituai.com/api/skills/clawskills%3Akn7d5eszdfwftk153ymhdm4qhs83qsqy~datadog-monitor-designer/file -o datadog-monitor-designer.md

# Datadog Monitor Designer

Design Datadog monitors that page humans only when paging is justified. Acts as a senior SRE who has run a Datadog org with 4,000 monitors, watched 90% of pages get auto-resolved with no action, and rebuilt the monitor estate around SLOs and tier templates until the false-positive rate fell from 70% to 8%.

This skill builds and tunes monitors. It does not replace your incident response process, your SLO design process, or your service catalog — it consumes them. Outputs are concrete monitor JSON (importable via Datadog API or Terraform `datadog_monitor` resource), notification message templates, downtime windows, and a per-service monitor inventory tied to ownership tags.

## Usage

Invoke when:

- Monitors page on every minor blip; on-call comp is rising
- A new T0 service is shipping and needs its monitor pack from day one
- An existing service has 80 monitors and only 6 ever fire
- SLOs were defined in a doc but never wired to alerts
- A postmortem revealed the right monitor existed but conditioned on the wrong tag set
- The Datadog bill jumped because of `avg by:` cardinality on a custom metric used in monitors
- Monitor message bodies are unparseable; runbook links are missing or rotted
- A monitor fires across all environments because nobody scoped it to `env:prod`

**Basic invocations:**
> Design a T0 monitor pack for our checkout service
> Convert our 99.9% latency SLO into burn-rate alerts
> Audit our 4,000 monitors and tell me which 80% are noise
> Write a notification template the on-call doesn't have to decode

## Inputs Required

- Datadog org + API key (or a monitor export via `datadog-monitor` Terraform / API dump)
- Service catalog: name, tier (T0/T1/T2), team owner, runbook URL, dashboard URL
- SLOs in scope (objective, window, error budget) — Datadog SLO objects or external doc
- Routing targets: PagerDuty service per team, Slack channels, on-call schedule
- Existing tag taxonomy: `env`, `service`, `team`, `criticality`, `version`, `region`
- Recent incident list (last 90 days) — for monitor coverage analysis
- Custom metric inventory + cost (Datadog → Plan & Usage → Custom Metrics)
- Constraints: budget limit, custom metric cap, log indexing limits

## Workflow

1. **Inventory existing monitors.** API: `GET /api/v1/monitor` (paginate). Tag each monitor with: type, service, tier, fired in last 90d?, ack rate, false-positive rate (acks resolved with "no action"), downstream routing. Anything that hasn't fired in 90 days and isn't tied to a documented invariant is a deletion candidate.

2. **Classify by tier.** Cross-reference the service catalog. Every monitor should be tagged `tier:T0|T1|T2|T3` and `team:<owner>`. Monitors without a tier or owner go to a triage list.

3. **Map SLOs to burn-rate alerts.** For each SLO, generate the multi-window multi-burn-rate (MWMB) monitor pair: a fast-burn alert (high rate, short window) and a slow-burn alert (low rate, long window). See the SLO recipe section.

4. **Apply tier templates.** Each tier has a base monitor set: availability, latency, saturation, error rate, dependency health. Generate the tier template with placeholder substitution for service name and metric source.

5. **Pick the right monitor type per signal.** Threshold for known SLAs and SLOs, anomaly for diurnal seasonal metrics, forecast for trend-based saturation (disk, certs, quota), outlier for fleets where one host misbehaves, composite for "A AND B for X minutes" without doubling pages.

6. **Engineer notification messages.** Every monitor message has the same eight required elements (see Notification Anatomy). Use Datadog template variables (`{{value}}`, `{{host.name}}`, `{{ #is_alert }}…{{ /is_alert }}`) for live data; static text for runbook URL and severity.

7. **Wire tag-driven routing.** `@pagerduty-<service>` for P0/P1, `@slack-<channel>` for P2/P3. Routing is in the message body, scoped by `{{ #is_alert }}` so resolved-events don't re-page.

8. **Set downtime windows.** Deploy windows, maintenance windows, known-noisy windows. Use `Downtime` API with scope filters (`env:prod service:checkout`); document expected duration.

9. **Configure no-data behavior.** `notify_no_data: true` is correct for "this metric should always have data" (heartbeats, uptime). For sparse metrics, `notify_no_data: false` plus a separate uptime monitor. Never default to `true` on every monitor — it pages on deploys.

10. **Group by stable dimensions only.** `group by: host` on auto-scaling fleets explodes alert count on scale-up. Group by `service`, `env`, `cluster`, `customer-tier`. Avoid `host` and `pod_name` in monitor groupings unless the monitor is host-specific.

11. **Test the monitor.** Dry-run with `Test Notifications` button or API `POST /api/v1/monitor/{id}/notify`. Verify routing, message rendering, runbook link, severity. Fire-drill once per quarter via the Datadog `mute_status_handle` or by deliberately tripping the threshold in a synthetic.

12. **Document the monitor.** Each monitor has a `runbook_url` tag and the runbook link in the message. The runbook covers: what the monitor means, what to check first, who to escalate to, common causes, common false positives.

13. **Schedule audits.** Monthly: stale monitor pruning, false-positive review. Quarterly: tier template refresh, SLO re-baselining, cost review.

## Monitor Type Decision Tree

Datadog has nine monitor types; most teams use Threshold for everything. That's how you get 4,000 monitors that don't catch real issues. Pick the type that matches the signal shape.

```
What are you alerting on?
├── A static SLA / SLO threshold (latency < 500ms, error rate < 1%)
│     → Metric Threshold monitor
│
├── A trend that crosses a threshold over time (disk fills, quota exhaust, cert expiry)
│     → Forecast monitor (linear or seasonal forecast)
│
├── A metric with a strong daily/weekly seasonality (traffic, signups)
│     → Anomaly monitor (agile, robust, or basic algorithm)
│
├── One host/pod/instance behaving differently from its peers
│     → Outlier monitor (DBSCAN, MAD, or scaledZ)
│
├── A condition that requires multiple signals to all be true
│     → Composite monitor (AND of two metric monitors)
│
├── An event happening (deploy, security finding, audit log)
│     → Event monitor or Event-V2 monitor
│
├── A log pattern occurring at rate
│     → Log monitor (don't use Threshold on a log-based metric — Log monitor is cheaper)
│
├── An external endpoint being reachable
│     → Synthetic monitor (browser or API test)
│
└── A process / service running on a host
      → Process monitor (legacy) or Service Check monitor
```

**Rules of thumb:**
- Anomaly monitors are great for *unexpected* changes but terrible for known invariants. Use them on traffic, not error rate.
- Forecast monitors require >2 weeks of history; don't use on new metrics.
- Outlier monitors silently break when the fleet has <5 members. Set a min-host gate.
- Composite monitors don't multiply cost; they reduce alert count by AND'ing.
- Log monitors index all matched logs — they cost on log volume, not metric count.

## Service Tier Templates

Each tier ships with a fixed monitor pack. A new T0 service goes from zero to fully covered in 15 minutes by importing the template.

### T0 Critical (revenue path, auth, payments)

| Monitor | Type | Threshold | Window | Notify |
|---------|------|-----------|--------|--------|
| Availability (HTTP 5xx rate) | Metric Threshold | >0.5% over 5m | 5m | P0 → PagerDuty (urgent) |
| p99 Latency | Metric Threshold | >1.5x SLO over 10m | 10m | P1 → PagerDuty (high) |
| Error budget burn (fast) | SLO burn-rate | 14.4x burn over 1h | 1h | P0 → PagerDuty (urgent) |
| Error budget burn (slow) | SLO burn-rate | 6x burn over 6h | 6h | P1 → PagerDuty (high) |
| Saturation (CPU/mem) | Forecast | >85% in 24h | 24h forecast | P2 → Slack |
| Dependency health | Composite | upstream availability < 99% AND request rate > 100/s | 5m | P2 → Slack |
| Deploy regre