grafana-panel-engineer

TotalClaw 作者 kn7d5eszdfwftk153ymhdm4qhs83qsqy v1.0.0

设计 Grafana 仪表板工程师实际上在凌晨 3 点的压力下使用,而不是在供应商宣传中看起来不错的漂亮仪表板。涵盖 USE / RED / 四个黄金信号布局、面板类型选择(时间序列、仪表、统计、表、热图、状态时间线)、具有级联和多值模式的变量模板、仪表板之间的深入链接、从 Prometheus 到 Tempo 的示例跟踪链接、混合数据源查询(Prom + Loki + Tempo + Pyrscope)、转换规则、从面板生成警报以及查询性能优化(基数、范围、最大点数)。作为一名高级 SRE,他构建了在独角兽规模的公司发生事件期间实际打开的待命轮换仪表板。当仪表板不可读时、服务需要从头开始使用仪表板包时、查询成本较高时或需要连接示例跟踪链接时使用。在“grafana”、“grafana仪表板”、“面板设计”、“使用方法”、“红色方法”、“四个黄金信号”、“exemplar”、“prometheus”、“loki”、“tempo”、“pyrscope”、“模板”、“变量”、“向下钻取”、“基数”、“仪表板性能”、“grafana警报”上触发。

安装 / 下载方式

TotalClaw CLI推荐
totalclaw install totalclaw:kn7d5eszdfwftk153ymhdm4qhs83qsqy~grafana-panel-engineer
cURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Akn7d5eszdfwftk153ymhdm4qhs83qsqy~grafana-panel-engineer/file -o grafana-panel-engineer.md
## 概述(中文)

设计 Grafana 仪表板工程师实际上在凌晨 3 点的压力下使用,而不是在供应商宣传中看起来不错的漂亮仪表板。涵盖 USE / RED / 四个黄金信号布局、面板类型选择(时间序列、仪表、统计、表、热图、状态时间线)、具有级联和多值模式的变量模板、仪表板之间的深入链接、从 Prometheus 到 Tempo 的示例跟踪链接、混合数据源查询(Prom + Loki + Tempo + Pyrscope)、转换规则、从面板生成警报以及查询性能优化(基数、范围、最大点数)。作为一名高级 SRE,他构建了在独角兽规模的公司发生事件期间实际打开的待命轮换仪表板。当仪表板不可读时、服务需要从头开始使用仪表板包时、查询成本较高时或需要连接示例跟踪链接时使用。在“grafana”、“grafana仪表板”、“面板设计”、“使用方法”、“红色方法”、“四个黄金信号”、“exemplar”、“prometheus”、“loki”、“tempo”、“pyrscope”、“模板”、“变量”、“向下钻取”、“基数”、“仪表板性能”、“grafana警报”上触发。

## 原文

# Grafana Panel Engineer

Design Grafana dashboards that load fast, answer the right question in under 30 seconds, and survive being looked at by a sleepy on-call engineer on a phone. Acts as a senior SRE who has built and pruned thousands of dashboards across Prometheus, Loki, Tempo, Pyroscope, Mimir, InfluxDB, and CloudWatch — and knows which dashboard layouts the rotation actually opens during incidents (RED for service health, USE for capacity, customer journey for product-tier impact).

This skill builds dashboards. It does not write your detection logic, define your SLOs, or replace your APM. It assumes data is already flowing into a Grafana datasource; the job is panel selection, layout, variables, drill-downs, exemplar linking, and query optimization. Output is dashboard JSON (importable, version-controlled, ideally as Grafana provisioning YAML or Terraform `grafana_dashboard`), plus a panel-design playbook for new services.

## Usage

Invoke when:

- A new service is shipping and needs its dashboard pack from day one
- An existing dashboard takes 60 seconds to load
- Engineers say "I never know which dashboard to open during an incident"
- A panel uses a query with cardinality of 50,000 series and the renderer hangs
- Exemplars from Prometheus aren't wired to Tempo traces
- Variable dropdowns produce 10,000 entries; selection is unusable
- Dashboards are duplicated across teams with subtle drift
- A dashboard alert triggers but the panel doesn't show what tripped it
- Loki + Prometheus + Tempo are all installed but never queried in the same panel
- Grafana usage analytics show the dashboard count is 1,200 but only 80 are opened in a month

**Basic invocations:**
> Build a dashboard pack for our new T0 checkout service
> Audit our 1,200 dashboards and prune the unused ones
> Wire Prometheus exemplars to Tempo traces in our HTTP latency panel
> Convert our service's dashboard from USE to RED + Four Golden Signals
> Optimize this dashboard — it takes 90s to load and hangs the browser

## Inputs Required

- Grafana URL + API key (or service-account token, ideally read-write)
- Datasource list: Prometheus/Mimir/Thanos, Loki, Tempo, Pyroscope, others
- Service catalog: name, tier (T0-T3), team owner, runbook URL
- Existing dashboard inventory (export via API: `GET /api/search?type=dash-db`)
- Metric naming conventions (Prometheus: `service_name_thing_total`, etc.)
- Trace sampling: % sampled, exemplar storage backend, trace ID propagation
- SLOs and the metrics behind them (for SLO dashboards)
- Constraints: max series per panel, query timeout, dashboard refresh rate

## Workflow

1. **Inventory existing dashboards.** API: `GET /api/search`, `GET /api/dashboards/uid/{uid}`. Tag each with: last viewed (Grafana usage analytics), data sources used, panel count, owner. Anything not viewed in 90 days and not the only copy of its data is a deletion candidate.

2. **Classify dashboards by purpose.** Service health (RED), capacity (USE), customer journey (funnel/conversion), SLO (error budget), debug (drill-down for incident response), executive (summary). Each purpose has a different layout pattern.

3. **Pick a layout pattern per dashboard.** USE (Utilization/Saturation/Errors) for resource panels — CPU, memory, disk, network. RED (Rate/Errors/Duration) for request-driven services. Four Golden Signals for full service health. Customer journey for product flows. See Dashboard Layouts.

4. **Pick the right panel type per metric.** Time series for trends, stat for current value with trend sparkline, gauge for "% of capacity" only, heatmap for distributions and percentile streaks, table for top-N, state-timeline for service state, bar gauge for ranked comparison. See Panel Type Decision Tree.

5. **Design variables (templating).** Variables drive reusability. Cascade environment → cluster → namespace → service → instance. Use `Multi-value` and `Include All` carefully — they explode query cardinality. See Variable Templating Patterns.

6. **Build the queries.** PromQL with `rate()` for counters, `histogram_quantile()` for histograms, `irate()` only for short-window debug. Set query interval to match Prometheus scrape (`$__rate_interval`), not the dashboard refresh.

7. **Add drill-down links.** Each panel that shows a metric for a service should link to the deeper service dashboard. Each error-rate panel should link to logs (Loki) and traces (Tempo) filtered to the same time range and labels.

8. **Wire exemplars.** Prometheus histograms with exemplar support emit trace IDs alongside latency buckets. Configure the panel's "Exemplars" toggle and link to Tempo with `${__value.raw}` as the trace ID.

9. **Optimize for performance.** Limit cardinality (`topk`, `bottomk`), set max points (`Max data points` panel option), use recording rules for expensive queries, push aggregation to Prometheus where possible. See Performance Optimization.

10. **Set dashboard refresh + time range defaults.** Refresh: 30s for live ops dashboards, 5m for everything else, off for debug dashboards. Default time range: last 1h for ops, last 24h for trend dashboards.

11. **Add panel descriptions.** Every panel has a description: what the metric means, what's normal, what's bad, link to runbook. Right-click info icon shows it.

12. **Generate alerts from panels.** Grafana unified alerting: from a panel, build an alert rule with the same query. Tie alert annotations to dashboard panel link so on-call gets a deep-link to the panel.

13. **Provision via code.** Dashboard JSON in git, deployed via Grafana provisioning YAML or Terraform. UI edits drift; provisioning prevents drift.

14. **Schedule audits.** Monthly: stale dashboard pruning (Grafana usage analytics), variable cardinality audit, query cost review. Quarterly: layout refresh, datasource audit.

## Dashboard Layouts

### USE Method (Brendan Gregg)

For *resources*: CPU, memory, disk, network, file handles. Three panels per resource.

```
[ Utilization % ]   [ Saturation (queue depth, runqueue) ]   [ Errors / sec ]
```

Layout: one row per resource (CPU row, Memory row, Disk row, Network row). Each row has 3 panels in U/S/E order. Time series, stacked by host or pod, top-10 by `topk(10, ...)`.

**When to use:** capacity / fleet health dashboards. Not for service health (use RED).

### RED Method (Tom Wilkie)

For *services*: Rate (req/s), Errors (errors/s or %), Duration (p50/p95/p99).

```
Row 1 — Top-line:
  [ Rate (req/s) ]   [ Error rate (%) ]   [ p99 Duration (ms) ]
Row 2 — Per-endpoint:
  [ Rate by endpoint (stacked) ]   [ Error rate by endpoint (table top-10) ]   [ Duration heatmap ]
Row 3 — Per-status:
  [ Status code breakdown (stacked area) ]   [ 5xx by endpoint ]   [ Slow endpoints (table) ]
```

**When to use:** any request-driven service (HTTP, gRPC, message queue consumer). Default for T0/T1 services.

### Four Golden Signals (Google SRE Book)

Latency, Traffic, Errors, Saturation. RED + saturation.

```
Row 1: [ Traffic (req/s) ] [ Errors (%) ] [ Latency p99 ] [ Saturation (CPU/mem) ]
Row 2: [ Traffic by version (deploy overlay) ] [ Errors by error class ] [ Latency heatmap with exemplars ] [ Saturation forecast ]
Row 3: [ Throughput vs latency scatter ] [ Top errors (table) ] [ Slow queries (table from logs) ] [ Capacity headroom % ]
```

**When to use:** full service health for T0 services. Most flexible single dashboard.

### Customer Journey Layout

For end-to-end product flows (signup, checkout, search-to-purchase). Each