pagerduty-escalation-architect
设计 PagerDuty 升级策略、时间表、服务、响应流程和事件工作流程,以处理真正的待命流量,而不会耗尽轮换时间。涵盖全天候轮换、主要/次要模式、每周交接、与严重性匹配的升级超时、工作时间与始终在线服务、假期覆盖模式、训练阴影轮换、响应播放组合(状态页面、Slack、会议桥、Zoom 自动创建)、事件工作流程以及从 PagerDuty 时间线自动生成事后分析。作为一名高级 SRE,他在三个时区运行了一个由 200 名工程师组成的待命计划,审核了薪酬模型的公平性,并通过了多次 SOC2 升级证据审核。当 on-call 不公平时、当升级在人类看到之前超时时、当新服务需要设置路由时、当并购合并两个 PagerDuty 租户时,或者当补偿/公平数学逾期时,请使用。在“pagerduty”、“升级策略”、“待命轮换”、“时间表”、“跟随太阳”、“主要次要”、“响应播放”、“事件工作流程”、“待命补偿”、“待命公平”、“影子轮换”、“服务路由”、“pd 路由”上触发。
安装 / 下载方式
TotalClaw CLI推荐
totalclaw install totalclaw:kn7d5eszdfwftk153ymhdm4qhs83qsqy~pagerduty-escalation-architectcURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Akn7d5eszdfwftk153ymhdm4qhs83qsqy~pagerduty-escalation-architect/file -o pagerduty-escalation-architect.md## 概述(中文) 设计 PagerDuty 升级策略、时间表、服务、响应流程和事件工作流程,以处理真正的待命流量,而不会耗尽轮换时间。涵盖全天候轮换、主要/次要模式、每周交接、与严重性匹配的升级超时、工作时间与始终在线服务、假期覆盖模式、训练阴影轮换、响应播放组合(状态页面、Slack、会议桥、Zoom 自动创建)、事件工作流程以及从 PagerDuty 时间线自动生成事后分析。作为一名高级 SRE,他在三个时区运行了一个由 200 名工程师组成的待命计划,审核了薪酬模型的公平性,并通过了多次 SOC2 升级证据审核。当 on-call 不公平时、当升级在人类看到之前超时时、当新服务需要设置路由时、当并购合并两个 PagerDuty 租户时,或者当补偿/公平数学逾期时,请使用。在“pagerduty”、“升级策略”、“待命轮换”、“时间表”、“跟随太阳”、“主要次要”、“响应播放”、“事件工作流程”、“待命补偿”、“待命公平”、“影子轮换”、“服务路由”、“pd 路由”上触发。 ## 原文 # PagerDuty Escalation Architect Design and tune PagerDuty escalation policies, schedules, services, and response plays. Acts as a senior SRE who has built on-call programs across three continents, run the fairness audits when engineers complain about uneven pages, and rebuilt the routing during a merger that combined two PagerDuty tenants with conflicting taxonomies. This skill builds the routing — schedules, escalation policies, services, response plays, business-hour rules, overrides, and the math behind on-call comp. It does not write your detection logic (that's the monitoring stack), and it does not run incidents (that's the incident commander). It assumes your alert sources are already wiring into PagerDuty; the job is what happens *after* the alert lands. ## Usage Invoke when: - Engineers say on-call is uneven; some weeks are 30 pages, others are 0 - A service was routed to a defunct schedule and pages went nowhere for hours - The escalation policy times out before any human sees the page - A new T0 service is onboarding and needs routing from day one - A merger doubled the PagerDuty tenant and the routing is a mess - Postmortems lack the PD timeline; nobody reconstructs the response - On-call comp is being challenged; the fairness math needs to be defensible - Vacation overrides are manual and break every quarter - A status page should auto-update from PagerDuty incidents but doesn't - SOC2 auditors want evidence of escalation timeliness **Basic invocations:** > Build escalation policies for our 12 services, T0 to T2 > We're going follow-the-sun across US/EU/APAC — design the schedule > Audit our PD tenant; 80 services, 120 schedules, no idea what's used > Generate the on-call fairness report for Q1 > Compose response plays so a P1 auto-creates a Slack channel and Zoom bridge ## Inputs Required - PagerDuty subdomain + API token (read at minimum, write for changes) - Service catalog: name, tier (T0-T3), team owner, expected page volume - Engineer roster per team with time zones, vacation calendar, training status - On-call comp policy (flat rate, hourly, comp time, etc.) - Existing escalation policies and schedules (export via API) - Routing requirements: business-hours-only services, follow-the-sun targets - Integrations in use: Slack, Zoom, Statuspage, Jira, Linear, Datadog, Sentry - Compliance constraints (SOC2, ISO27001 evidence requirements) ## Workflow 1. **Audit the existing tenant.** API calls: `GET /services`, `/escalation_policies`, `/schedules`, `/users`, `/teams`, `/response_plays`. Tag each object with last-used date (incidents API, last 90 days). Anything unused for 90+ days is a deletion candidate, with the exception of compliance-required services. 2. **Build the team roster.** Each engineer needs: time zone, on-call eligibility (probation? training shadow? full?), vacation windows for the next 6 months, language/region constraints. This drives schedule design. 3. **Map services to tiers.** Cross-reference the service catalog. Each PD service must have `tier` in its name or description and a `team` association. Untiered services go to a triage list. 4. **Design schedules per team using a documented pattern** (see Schedule Design Patterns). Most teams pick: weekly handoff Mon 09:00 local, primary + secondary, follow-the-sun if multi-region. 5. **Build escalation policies tied to severity.** Tier dictates timeouts: T0 escalates fast (5 min → 10 min → 15 min); T2 slow (30 min → 60 min → out-of-hours suppress). Document each policy's *intent* in its description field. 6. **Wire services to escalation policies.** Each PD service points to one escalation policy. Use Event Rules and Service Orchestration for routing fan-out (one alert source → multiple services based on payload). 7. **Build response plays.** Each tier has a baseline play that fires on incident trigger: post Slack channel, create Zoom, post Statuspage placeholder, page secondary on critical, link runbook. Plays are composable; build a library. 8. **Set business-hour rules.** Some services should never page at 3am (internal tools, marketing site). Use schedule "layers" — restricted to business hours — that fall through to a digest queue out-of-hours. 9. **Configure overrides for vacation.** Calendar integration (Google Calendar) auto-overrides; manual overrides are a backup. Document the swap process in the runbook so engineers don't manually edit schedules. 10. **Set up training shadow rotations.** New on-call engineers shadow for 4-8 weeks: get the same pages, no notification responsibility, debrief with primary. Implement as a parallel layer that doesn't escalate. 11. **Wire integrations.** Slack: incident channels auto-created, status posts on state changes. Zoom: bridge auto-created on P0/P1. Statuspage: incident drafts on customer-facing services. Jira/Linear: post-incident ticket auto-created with timeline. 12. **Wire incident workflows.** PagerDuty Incident Workflows trigger on conditions (priority, service, payload field). Use them to fork response plays, auto-add stakeholders, run command-line tools. 13. **Generate the fairness report.** Per-engineer page count, page count outside business hours, weekend page count, escalations missed. Ship monthly. See Fairness Math section. 14. **Schedule audits.** Monthly: stale-services pruning, override audit, fairness review. Quarterly: tier reassignment, schedule pattern review. Annually: compliance evidence export. ## Schedule Design Patterns The biggest source of on-call pain is the schedule. Pick one of these patterns deliberately; mixing patterns inside one team breaks fairness math. ### Pattern 1 — Weekly Handoff, Single Region, Primary + Secondary The default for teams of 6-12 engineers in one time zone. Each engineer is primary one week per N weeks; secondary the week before and after primary. ``` Schedule: <team>-primary Layer 1: rotation 7 days, Mon 09:00 local, members [A, B, C, D, E, F] Schedule: <team>-secondary Layer 1: rotation 7 days, Mon 09:00 local, members [B, C, D, E, F, A] # offset by 1 ``` Pages first hit primary (5 min) → secondary (10 min) → manager (20 min). Comp: flat rate per primary week; secondary at 0.3x rate. ### Pattern 2 — Follow-the-Sun (US / EU / APAC) For 24/7 critical services with engineers in three regions. Each region holds the rotation during their business day. ``` Schedule: <team>-fts Layer 1: members [A_us, B_us, C_us], 09:00–17:00 PT, Mon–Fri Layer 2: members [D_eu, E_eu, F_eu], 09:00–17:00 CET, Mon–Fri Layer 3: members [G_apac, H_apac, I_apac], 09:00–17:00 SGT, Mon–Fri Final layer: weekend rotation across all members ``` Each engineer is on-call only during their business day. Weekend rotation rotates fairly across all 9. Pages between regions auto-handoff at layer boundaries — no engineer is paged at 3am unless the weekend rotation is on them. Caveat: requires all three regions to be staffed. If APAC has 1 engineer, you're back to a single-point rotation. Start with US+EU follow-the-sun, add APAC when staffing supports it. ### Pattern 3 — Daily Rotation, Small Team For teams of 3-5 where weekly is too long (one engineer is "on" 25% of all weeks). Daily rotation flattens load. ``` Schedule: <team>-daily Layer 1: rotation 1 day, Mon–Fri 09:00 local, members [A, B, C, D] Layer 2 (weekend): rotation 2 days, Sat 09:00 local, members [A, B, C, D] ``` Comp tracked daily. Trade-off: more handoffs = more dropped context. Document handoff in a daily on-call log. ### Pattern 4 —