grafana-panel-engineer

ClawSkills 作者 kn7d5eszdfwftk153ymhdm4qhs83qsqy v1.0.0

Design Grafana dashboards engineers actually use under pressure at 3am, not pretty dashboards that look good in a vendor pitch. Covers USE / RED / Four Golden Signals layouts, panel type selection (time series vs gauge vs stat vs table vs heatmap vs state-timeline), variable templating with cascading and multi-value patterns, drill-down links between dashboards, exemplar trace linking from Prometheus to Tempo, mixed-datasource queries (Prom + Loki + Tempo + Pyroscope), transformation rules, alert generation from panels, and query performance optimization (cardinality, range, max points). Acts as a senior SRE who has built the dashboards the on-call rotation actually opens during incidents at unicorn-scale companies. Use when a dashboard is unreadable, when a service needs its dashboard pack from scratch, when query cost is high, or when exemplar trace linking needs to be wired. Triggers on "grafana", "grafana dashboard", "panel design", "use method", "red method", "four golden signals", "exemplar", "prometheus", "loki", "tempo", "pyroscope", "templating", "variable", "drill-down", "cardinality", "dashboard performance", "grafana alert".

安装 / 下载方式

TotalClaw CLI推荐
totalclaw install clawskills:kn7d5eszdfwftk153ymhdm4qhs83qsqy~grafana-panel-engineer
cURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/clawskills%3Akn7d5eszdfwftk153ymhdm4qhs83qsqy~grafana-panel-engineer/file -o grafana-panel-engineer.md
# Grafana Panel Engineer

Design Grafana dashboards that load fast, answer the right question in under 30 seconds, and survive being looked at by a sleepy on-call engineer on a phone. Acts as a senior SRE who has built and pruned thousands of dashboards across Prometheus, Loki, Tempo, Pyroscope, Mimir, InfluxDB, and CloudWatch — and knows which dashboard layouts the rotation actually opens during incidents (RED for service health, USE for capacity, customer journey for product-tier impact).

This skill builds dashboards. It does not write your detection logic, define your SLOs, or replace your APM. It assumes data is already flowing into a Grafana datasource; the job is panel selection, layout, variables, drill-downs, exemplar linking, and query optimization. Output is dashboard JSON (importable, version-controlled, ideally as Grafana provisioning YAML or Terraform `grafana_dashboard`), plus a panel-design playbook for new services.

## Usage

Invoke when:

- A new service is shipping and needs its dashboard pack from day one
- An existing dashboard takes 60 seconds to load
- Engineers say "I never know which dashboard to open during an incident"
- A panel uses a query with cardinality of 50,000 series and the renderer hangs
- Exemplars from Prometheus aren't wired to Tempo traces
- Variable dropdowns produce 10,000 entries; selection is unusable
- Dashboards are duplicated across teams with subtle drift
- A dashboard alert triggers but the panel doesn't show what tripped it
- Loki + Prometheus + Tempo are all installed but never queried in the same panel
- Grafana usage analytics show the dashboard count is 1,200 but only 80 are opened in a month

**Basic invocations:**
> Build a dashboard pack for our new T0 checkout service
> Audit our 1,200 dashboards and prune the unused ones
> Wire Prometheus exemplars to Tempo traces in our HTTP latency panel
> Convert our service's dashboard from USE to RED + Four Golden Signals
> Optimize this dashboard — it takes 90s to load and hangs the browser

## Inputs Required

- Grafana URL + API key (or service-account token, ideally read-write)
- Datasource list: Prometheus/Mimir/Thanos, Loki, Tempo, Pyroscope, others
- Service catalog: name, tier (T0-T3), team owner, runbook URL
- Existing dashboard inventory (export via API: `GET /api/search?type=dash-db`)
- Metric naming conventions (Prometheus: `service_name_thing_total`, etc.)
- Trace sampling: % sampled, exemplar storage backend, trace ID propagation
- SLOs and the metrics behind them (for SLO dashboards)
- Constraints: max series per panel, query timeout, dashboard refresh rate

## Workflow

1. **Inventory existing dashboards.** API: `GET /api/search`, `GET /api/dashboards/uid/{uid}`. Tag each with: last viewed (Grafana usage analytics), data sources used, panel count, owner. Anything not viewed in 90 days and not the only copy of its data is a deletion candidate.

2. **Classify dashboards by purpose.** Service health (RED), capacity (USE), customer journey (funnel/conversion), SLO (error budget), debug (drill-down for incident response), executive (summary). Each purpose has a different layout pattern.

3. **Pick a layout pattern per dashboard.** USE (Utilization/Saturation/Errors) for resource panels — CPU, memory, disk, network. RED (Rate/Errors/Duration) for request-driven services. Four Golden Signals for full service health. Customer journey for product flows. See Dashboard Layouts.

4. **Pick the right panel type per metric.** Time series for trends, stat for current value with trend sparkline, gauge for "% of capacity" only, heatmap for distributions and percentile streaks, table for top-N, state-timeline for service state, bar gauge for ranked comparison. See Panel Type Decision Tree.

5. **Design variables (templating).** Variables drive reusability. Cascade environment → cluster → namespace → service → instance. Use `Multi-value` and `Include All` carefully — they explode query cardinality. See Variable Templating Patterns.

6. **Build the queries.** PromQL with `rate()` for counters, `histogram_quantile()` for histograms, `irate()` only for short-window debug. Set query interval to match Prometheus scrape (`$__rate_interval`), not the dashboard refresh.

7. **Add drill-down links.** Each panel that shows a metric for a service should link to the deeper service dashboard. Each error-rate panel should link to logs (Loki) and traces (Tempo) filtered to the same time range and labels.

8. **Wire exemplars.** Prometheus histograms with exemplar support emit trace IDs alongside latency buckets. Configure the panel's "Exemplars" toggle and link to Tempo with `${__value.raw}` as the trace ID.

9. **Optimize for performance.** Limit cardinality (`topk`, `bottomk`), set max points (`Max data points` panel option), use recording rules for expensive queries, push aggregation to Prometheus where possible. See Performance Optimization.

10. **Set dashboard refresh + time range defaults.** Refresh: 30s for live ops dashboards, 5m for everything else, off for debug dashboards. Default time range: last 1h for ops, last 24h for trend dashboards.

11. **Add panel descriptions.** Every panel has a description: what the metric means, what's normal, what's bad, link to runbook. Right-click info icon shows it.

12. **Generate alerts from panels.** Grafana unified alerting: from a panel, build an alert rule with the same query. Tie alert annotations to dashboard panel link so on-call gets a deep-link to the panel.

13. **Provision via code.** Dashboard JSON in git, deployed via Grafana provisioning YAML or Terraform. UI edits drift; provisioning prevents drift.

14. **Schedule audits.** Monthly: stale dashboard pruning (Grafana usage analytics), variable cardinality audit, query cost review. Quarterly: layout refresh, datasource audit.

## Dashboard Layouts

### USE Method (Brendan Gregg)

For *resources*: CPU, memory, disk, network, file handles. Three panels per resource.

```
[ Utilization % ]   [ Saturation (queue depth, runqueue) ]   [ Errors / sec ]
```

Layout: one row per resource (CPU row, Memory row, Disk row, Network row). Each row has 3 panels in U/S/E order. Time series, stacked by host or pod, top-10 by `topk(10, ...)`.

**When to use:** capacity / fleet health dashboards. Not for service health (use RED).

### RED Method (Tom Wilkie)

For *services*: Rate (req/s), Errors (errors/s or %), Duration (p50/p95/p99).

```
Row 1 — Top-line:
  [ Rate (req/s) ]   [ Error rate (%) ]   [ p99 Duration (ms) ]
Row 2 — Per-endpoint:
  [ Rate by endpoint (stacked) ]   [ Error rate by endpoint (table top-10) ]   [ Duration heatmap ]
Row 3 — Per-status:
  [ Status code breakdown (stacked area) ]   [ 5xx by endpoint ]   [ Slow endpoints (table) ]
```

**When to use:** any request-driven service (HTTP, gRPC, message queue consumer). Default for T0/T1 services.

### Four Golden Signals (Google SRE Book)

Latency, Traffic, Errors, Saturation. RED + saturation.

```
Row 1: [ Traffic (req/s) ] [ Errors (%) ] [ Latency p99 ] [ Saturation (CPU/mem) ]
Row 2: [ Traffic by version (deploy overlay) ] [ Errors by error class ] [ Latency heatmap with exemplars ] [ Saturation forecast ]
Row 3: [ Throughput vs latency scatter ] [ Top errors (table) ] [ Slow queries (table from logs) ] [ Capacity headroom % ]
```

**When to use:** full service health for T0 services. Most flexible single dashboard.

### Customer Journey Layout

For end-to-end product flows (signup, checkout, search-to-purchase). Each panel = one step in the funnel.

```
Row 1: [ Step 1 success% ] [ Step 2 success% ] [ Step 3 success% ] [ Step N success% ]
Row 2: [ Step 1 latency p99 ] [ Step 2 latency p99 ] ...
Row 3: [ Drop-off Sankey or stacked bar ]   [ Bottleneck step over time ]
Row 4: [ Per-segment funnel (enterprise vs free) ]   [ Errors at top drop-off step ]
```

**When to use:** product-tier dashboards, exec dashboards, SRE dashboards for revenue-path services.

### SLO / Error Budget Layout

```
Row 1 (top-lin