Performance Engineering System

ClawSkills 作者 1kalin v1.0.0

Complete performance engineering system — profiling, optimization, load testing, capacity planning, and performance culture. Use when diagnosing slow applications, optimizing code/queries/infrastructure, load testing before launch, planning capacity, or building performance into CI/CD. Covers Node.js, Python, Go, Java, databases, APIs, and frontend.

源码 ↗

安装 / 下载方式

TotalClaw CLI推荐

totalclaw install clawskills:1kalin~afrexai-performance-engineering

cURL直接下载，无需登录

curl -fsSL https://skills.taituai.com/api/skills/clawskills%3A1kalin~afrexai-performance-engineering/file -o afrexai-performance-engineering.md

Git 仓库获取源码

git clone https://github.com/openclaw/skills/commit/426f1427c543a99b2de9d58f1730109273ab77a2

# Performance Engineering System

> From "it's slow" to "here's why and here's the fix" — a complete methodology for measuring, diagnosing, optimizing, and preventing performance problems.

## Phase 1: Performance Investigation Brief

Before touching anything, define the problem.

```yaml
# performance-brief.yaml
investigation:
  reported_by: ""
  reported_date: ""
  system: ""              # service/app name
  environment: ""         # production, staging, dev

problem_statement:
  symptom: ""             # "API response time increased 3x"
  impact: ""              # "15% of users seeing timeouts"
  since_when: ""          # "After deploy v2.14 on Feb 20"
  affected_scope: ""      # "All endpoints" | "Only /search" | "Users in EU"

baselines:
  target_p50: ""          # e.g., "200ms"
  target_p95: ""          # e.g., "500ms"
  target_p99: ""          # e.g., "1000ms"
  current_p50: ""
  current_p95: ""
  current_p99: ""
  throughput_target: ""   # e.g., "1000 rps"
  error_rate_target: ""   # e.g., "<0.1%"

constraints:
  budget: ""              # time/money for optimization
  risk_tolerance: ""      # "Can we change the schema?" "Can we add caching?"
  deadline: ""            # "Must fix before Black Friday"

hypothesis:
  primary: ""             # "N+1 queries in the new recommendation engine"
  secondary: ""           # "Connection pool exhaustion under load"
  evidence: ""            # "Slow query log shows 200+ queries per request"
```

### Performance Budget Framework

Set budgets BEFORE building, not after complaints:

| Metric | Web App | API | Mobile | Batch Job |
|--------|---------|-----|--------|-----------|
| P50 response | <200ms | <100ms | <300ms | N/A |
| P95 response | <500ms | <250ms | <800ms | N/A |
| P99 response | <1s | <500ms | <1.5s | N/A |
| Error rate | <0.1% | <0.01% | <0.5% | <0.001% |
| Time to Interactive | <3s | N/A | <2s | N/A |
| Memory per request | <50MB | <20MB | <100MB | <1GB |
| CPU per request | <100ms | <50ms | <200ms | N/A |
| Throughput | 100+ rps | 500+ rps | N/A | items/min |

## Phase 2: Measurement & Profiling

### The Golden Rule
**Never optimize without measuring first. Never measure without a hypothesis.**

### Profiling Decision Tree

```
Is it slow?
├── YES → Where is time spent?
│   ├── CPU-bound → Profile CPU (flame graph)
│   │   ├── Hot function found → Optimize algorithm/data structure
│   │   └── Spread evenly → Architecture problem (too many layers)
│   ├── I/O-bound → Profile I/O
│   │   ├── Database → Query analysis (Phase 4)
│   │   ├── Network → Connection profiling
│   │   ├── Disk → I/O scheduler + buffering
│   │   └── External API → Caching + async + circuit breaker
│   ├── Memory-bound → Profile allocations
│   │   ├── GC pressure → Reduce allocations, pool objects
│   │   ├── Memory leak → Heap snapshot comparison
│   │   └── Cache thrashing → Resize or eviction policy
│   └── Concurrency-bound → Profile locks/contention
│       ├── Lock contention → Reduce critical section, lock-free structures
│       ├── Thread starvation → Pool sizing
│       └── Deadlock → Lock ordering analysis
└── NO → Define "fast enough" (see budgets above)
```

### CPU Profiling by Language

#### Node.js
```bash
# Built-in profiler (V8)
node --prof app.js
node --prof-process isolate-*.log > profile.txt

# Inspector-based (connect Chrome DevTools)
node --inspect app.js
# Open chrome://inspect → Profiler → Start

# Clinic.js (best overall Node.js profiler)
npx clinic doctor -- node app.js
npx clinic flame -- node app.js    # Flame graph
npx clinic bubbleprof -- node app.js  # Async bottlenecks

# 0x (flame graphs)
npx 0x app.js
```

#### Python
```python
# cProfile (built-in)
import cProfile
import pstats

profiler = cProfile.Profile()
profiler.enable()
# ... code to profile ...
profiler.disable()

stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(20)  # Top 20

# Line profiler (pip install line-profiler)
# Add @profile decorator, then:
# kernprof -l -v script.py

# py-spy (sampling profiler, no code changes)
# pip install py-spy
# py-spy top --pid <PID>
# py-spy record -o profile.svg --pid <PID>  # Flame graph

# Scalene (CPU + memory + GPU)
# pip install scalene
# scalene script.py
```

#### Go
```go
// Built-in pprof
import (
    "net/http"
    _ "net/http/pprof"
    "runtime/pprof"
)

// HTTP server (add to existing server)
// Access: http://localhost:6060/debug/pprof/
go func() { http.ListenAndServe(":6060", nil) }()

// CLI analysis
// go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
// go tool pprof -http=:8080 profile.out  # Web UI
```

#### Java
```bash
# async-profiler (best for JVM)
# https://github.com/async-profiler/async-profiler
./asprof -d 30 -f profile.html <PID>

# JFR (built-in since JDK 11)
java -XX:StartFlightRecording=duration=60s,filename=rec.jfr MyApp
jfr print --events CPULoad rec.jfr

# jstack (thread dump)
jstack <PID> > threads.txt
```

### Memory Profiling

#### Leak Detection Pattern (any language)
```
1. Take heap snapshot at T0
2. Run suspected operation N times
3. Force GC
4. Take heap snapshot at T1
5. Compare: objects that grew = potential leak
6. Check: are they reachable? From where? (retention path)
```

#### Node.js Memory
```javascript
// Heap snapshot
const v8 = require('v8');
const fs = require('fs');

function takeSnapshot(label) {
  const snapshotStream = v8.writeHeapSnapshot();
  console.log(`Heap snapshot written to ${snapshotStream}`);
}

// Process memory monitoring
setInterval(() => {
  const mem = process.memoryUsage();
  console.log({
    rss_mb: (mem.rss / 1048576).toFixed(1),
    heap_used_mb: (mem.heapUsed / 1048576).toFixed(1),
    heap_total_mb: (mem.heapTotal / 1048576).toFixed(1),
    external_mb: (mem.external / 1048576).toFixed(1),
  });
}, 10000);
```

#### Python Memory
```python
# tracemalloc (built-in)
import tracemalloc

tracemalloc.start()
# ... code ...
snapshot = tracemalloc.take_snapshot()
top = snapshot.statistics('lineno')
for stat in top[:10]:
    print(stat)

# objgraph (pip install objgraph)
import objgraph
objgraph.show_most_common_types(limit=20)
objgraph.show_growth(limit=10)  # Call twice to see what's growing
```

### Flame Graph Interpretation

```
Reading a flame graph:
┌─────────────────────────────────────────────┐
│                  main()                      │  ← Entry point (bottom)
├──────────────────────┬──────────────────────┤
│     processData()    │    renderOutput()     │  ← Width = time spent
├──────────┬───────────┤                      │
│ parseCSV │ validate  │                      │  ← Tall = deep call stack
├──────────┤           │                      │
│ readline │           │                      │  ← Top = where CPU burns
└──────────┴───────────┴──────────────────────┘

WHAT TO LOOK FOR:
1. Wide plateaus at top → CPU-intensive leaf function (optimize this!)
2. Many thin towers → excessive function calls (batch or reduce)
3. Recursive patterns → potential stack overflow risk
4. Unexpected width → function taking more time than expected
5. GC/runtime frames → memory pressure

ACTION RULES:
- Plateau >20% width → must investigate
- Plateau >40% width → almost certainly the bottleneck
- If top 3 functions = 80% of time → focused optimization will work
- If evenly distributed → architectural change needed
```

## Phase 3: Common Optimization Patterns

### Algorithm & Data Structure Optimizations

| Problem | Bad O() | Fix | Good O() |
|---------|---------|-----|----------|
| Search unsorted array | O(n) | Sort + binary search, or use Set/Map | O(log n) or O(1) |
| Nested loop matching | O(n²) | Hash map lookup | O(n) |
| Repeated string concat | O(n²) | StringBuilder/join array | O(n) |
| Sorting already-sorted data | O(n log n) | Check if sorted first | O(n) |
| Finding duplicates | O(n²) | Set-based detection | O(n) |
| Frequent min/max of changing data | O(n) per query | Heap/priority queue | O(log n) |

### Caching Strategy Decision Matrix

```
Should you cache