fly-io-deployer

SkillDB 作者 kn7d5eszdfwftk153ymhdm4qhs83qsqy v1.0.0

Deploy and operate Node, Python, Go, Rust, Elixir, and Docker apps on Fly.io with production-grade fly.toml authoring, Machines API orchestration, region selection (latency vs sovereignty vs egress), Fly Postgres clustering, LiteFS for SQLite replication, Upstash Redis bindings, Tigris object storage, persistent volumes, WireGuard private networking with 6PN, secrets via flyctl, blue/green deploys via auto-stopping machines, scale-to-zero strategies, scheduled scaling, preview deploys per PR, multi-region replicas, hot-config reload, machine SSH, log shipping to Better Stack/Axiom/Datadog/Logtail, and aggressive cost tuning. Triggers on "fly.io", "flyctl", "fly machines", "fly.toml", "fly postgres", "litefs", "tigris", "upstash on fly", "fly deploy", "migrate from heroku", "migrate from render", "migrate from railway", "scale to zero", "fly regions", "fly volumes", "fly wireguard", "6pn", "fly secrets".

安装 / 下载方式

TotalClaw CLI推荐

totalclaw install skilldb:kn7d5eszdfwftk153ymhdm4qhs83qsqy~fly-io-deployer

cURL直接下载，无需登录

curl -fsSL https://skills.taituai.com/api/skills/skilldb%3Akn7d5eszdfwftk153ymhdm4qhs83qsqy~fly-io-deployer/file -o fly-io-deployer.md

# Fly.io Deployer

Plan, ship, and operate apps on Fly.io's Machines platform with the discipline of a senior platform engineer who has migrated production stacks off Heroku, Render, Railway, and AWS. Produces a deployable `fly.toml`, a region plan, a stateful-services plan (Postgres / LiteFS / Redis / Tigris), a CI/CD pipeline, and a cost model — all sized for the actual traffic shape, not the marketing demo.

## Usage

Invoke when starting a new Fly app, migrating onto Fly, debugging a sick deploy, planning multi-region rollout, or cutting the bill. Equally useful for greenfield ("we want to ship a Rust API to fly") and rescue work ("our Render bill tripled, get us off in 2 weeks").

**Basic invocation:**
> Deploy this Node + Postgres app to Fly.io
> Migrate our Heroku stack (web + worker + Postgres + Redis) onto Fly
> Cut our $1,400/mo Fly bill in half without dropping regions

**With context:**
> Here's the Dockerfile and the Heroku Procfile — produce fly.toml + a migration runbook
> We need EU + US Postgres replicas with read-your-writes from web nodes
> Auto-stop machines but keep p95 cold-start under 800ms for the web app

The agent emits a `fly.toml`, optional `Dockerfile`, `litefs.yml`, `flyctl` migration scripts, GitHub Actions for deploys + preview environments, secret rotation script, and a one-page cost model.

## Inputs Required

- **App shape** — runtime (Node/Python/Go/Rust/Elixir/Bun/Deno/Docker), framework (Next.js/Rails/Django/FastAPI/Phoenix/Actix), entrypoint
- **Stateful needs** — Postgres? Redis? S3-compatible object storage? File system? SQLite?
- **Traffic profile** — req/s peak, geographic distribution, p95 latency target, daily/weekly seasonality
- **Compliance constraints** — data residency (EU-only? US-only? FRA mandatory?), HIPAA/PCI scope
- **Origin platform** (if migrating) — Heroku / Render / Railway / Vercel / AWS / DigitalOcean
- **Budget ceiling** — monthly USD cap matters when picking machine sizes and replica counts

## Workflow

1. Read the app and classify it: stateless web, stateful web (sessions on disk), worker, scheduled job, ws server, RPC, or full-stack monolith
2. Pick primary region from latency to majority of users + sovereignty (`fly platform regions` enumerates current set)
3. Decide replicas: single-region multi-machine vs multi-region active-active vs primary+read-replicas
4. Choose stateful services: Fly Postgres cluster, LiteFS+SQLite, external Supabase/Neon, Upstash Redis, Tigris/R2/S3
5. Author `fly.toml` (anatomy section below); generate `Dockerfile` if missing
6. Wire secrets via `flyctl secrets set` (never bake into image)
7. Create the app + provision volumes + provision Postgres + attach
8. First deploy with `--strategy=immediate` to a single machine; verify health
9. Scale to target shape with `fly scale count` + `fly machine clone --region`
10. Wire CI (deploy on main, preview app per PR)
11. Wire log shipping (Vector → Better Stack/Axiom/Datadog) and metrics (Fly Prometheus + Grafana)
12. Configure auto-stop / auto-start for cost; tune min_machines_running
13. Document rollback (`fly releases list` + `fly deploy --image <prev-sha>`)

## fly.toml Anatomy

Every field, what it does, and the most common mistake.

```toml
app = "myapp-prod"                       # globally unique; -prod / -staging / -pr-<n>
primary_region = "fra"                   # closest to majority users; influences PG primary
kill_signal = "SIGINT"                   # SIGTERM default; SIGINT for Node/Python graceful
kill_timeout = "30s"                     # must exceed your slowest in-flight request
swap_size_mb = 512                       # ENABLE — saves OOM kills on tight machines

[build]
  dockerfile = "Dockerfile"              # explicit > auto-detect; nixpacks/buildpacks fragile
  # build_target = "runtime"             # multi-stage final stage
  # build_args = { NODE_ENV = "production" }

[deploy]
  strategy = "rolling"                   # rolling | bluegreen | canary | immediate
  max_unavailable = 0.33                 # rolling: fraction down at once
  release_command = "npm run db:migrate" # one-shot machine before traffic shifts
  wait_timeout = "5m"                    # hard ceiling on deploy duration

[env]
  PORT = "8080"                          # match internal_port below
  NODE_ENV = "production"
  LOG_FORMAT = "json"                    # required for proper log shipping
  # Never put secrets here — use `flyctl secrets set`

[experimental]
  auto_rollback = true                   # roll back on health-check failure

[[mounts]]
  source = "data"                        # name a volume created via `fly volumes create`
  destination = "/data"
  initial_size = "10gb"
  auto_extend_size_threshold = 80        # %, auto-extends volume
  auto_extend_size_increment = "5gb"
  auto_extend_size_limit = "100gb"
  snapshot_retention = 7                 # days; default is 5

[[services]]
  internal_port = 8080
  protocol = "tcp"
  auto_stop_machines = "stop"            # stop | suspend | off; suspend = warm pause
  auto_start_machines = true
  min_machines_running = 1               # 0 only if cold start is acceptable
  processes = ["app"]                    # gates which process group serves this port

  [[services.ports]]
    port = 80
    handlers = ["http"]
    force_https = true

  [[services.ports]]
    port = 443
    handlers = ["tls", "http"]
    [services.ports.tls_options]
      alpn = ["h2", "http/1.1"]
      versions = ["TLSv1.2", "TLSv1.3"]

  [services.concurrency]
    type = "connections"                 # or "requests" for HTTP-aware
    soft_limit = 200                     # start scaling up
    hard_limit = 250                     # refuse new conns

  [[services.tcp_checks]]
    interval = "15s"
    timeout = "2s"
    grace_period = "10s"                 # extends to first-deploy boot

  [[services.http_checks]]
    interval = "10s"
    timeout = "2s"
    grace_period = "30s"
    method = "GET"
    path = "/healthz"
    protocol = "http"
    tls_skip_verify = false
    [services.http_checks.headers]
      X-Health = "fly"

[[vm]]
  size = "shared-cpu-1x"                 # smallest; fine for low-traffic
  memory = "512mb"
  cpus = 1
  cpu_kind = "shared"                    # shared | performance
  # gpu_kind = "a10"                     # only if doing GPU inference
  processes = ["app"]

[processes]
  app    = "node server.js"
  worker = "node worker.js"
  cron   = "node cron.js"

[[statics]]
  guest_path = "/app/public"             # served from machine, off-CPU
  url_prefix = "/static/"

[metrics]
  port = 9091
  path = "/metrics"                      # Fly's Prometheus scrapes this
```

**Common mistakes:**
- `internal_port` does not match `PORT` env → connection refused, healthchecks 502
- `min_machines_running = 0` on a stateful service → first user gets a 30s cold start
- No `release_command` → migrations race the rolling deploy and break readers
- `kill_timeout` shorter than slowest request → 502s on every deploy
- `auto_stop_machines = "stop"` with attached volume but stateful in-RAM cache → cache cold every wake
- `processes` declared but no `[[services]] processes = [...]` filter → worker exposes HTTP

## Region Strategy

Fly has 35+ regions. Picking three is harder than picking one.

**Tiers by latency to global users** (rough p50 from CDN telemetry):

| Tier | Regions | Use case |
|------|---------|----------|
| 1 | `fra` (Frankfurt), `iad` (Ashburn), `sjc` (San Jose), `nrt` (Tokyo), `syd` (Sydney), `gru` (São Paulo) | Most apps land 80% of traffic in 3 of these |
| 2 | `lhr` (London), `cdg` (Paris), `ams` (Amsterdam), `ord` (Chicago), `dfw` (Dallas), `lax` (LA), `sea` (Seattle), `hkg` (Hong Kong), `sin` (Singapore), `bom` (Mumbai) | Fill p95 gaps |
| 3 | `arn` (Stockholm), `mad` (Madrid), `waw` (Warsaw), `otp` (Bucharest), `jnb` (Johannesburg), `eze` (Buenos Aires), `scl` (Santiago), `qro` (Querétaro), `gdl` (Guadalajara), `bog` (Bogotá), `den` (Denver), `mia` (Miami), `yyz` (Toronto