Performance Engineering System
Complete performance engineering system — profiling, optimization, load testing, capacity planning, and performance culture. Use when diagnosing slow applications, optimizing code/queries/infrastructure, load testing before launch, planning capacity, or building performance into CI/CD. Covers Node.js, Python, Go, Java, databases, APIs, and frontend.
安装 / 下载方式
TotalClaw CLI推荐
totalclaw install clawskills:1kalin~afrexai-performance-engineeringcURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/clawskills%3A1kalin~afrexai-performance-engineering/file -o afrexai-performance-engineering.mdGit 仓库获取源码
git clone https://github.com/openclaw/skills/commit/426f1427c543a99b2de9d58f1730109273ab77a2# Performance Engineering System
> From "it's slow" to "here's why and here's the fix" — a complete methodology for measuring, diagnosing, optimizing, and preventing performance problems.
## Phase 1: Performance Investigation Brief
Before touching anything, define the problem.
```yaml
# performance-brief.yaml
investigation:
reported_by: ""
reported_date: ""
system: "" # service/app name
environment: "" # production, staging, dev
problem_statement:
symptom: "" # "API response time increased 3x"
impact: "" # "15% of users seeing timeouts"
since_when: "" # "After deploy v2.14 on Feb 20"
affected_scope: "" # "All endpoints" | "Only /search" | "Users in EU"
baselines:
target_p50: "" # e.g., "200ms"
target_p95: "" # e.g., "500ms"
target_p99: "" # e.g., "1000ms"
current_p50: ""
current_p95: ""
current_p99: ""
throughput_target: "" # e.g., "1000 rps"
error_rate_target: "" # e.g., "<0.1%"
constraints:
budget: "" # time/money for optimization
risk_tolerance: "" # "Can we change the schema?" "Can we add caching?"
deadline: "" # "Must fix before Black Friday"
hypothesis:
primary: "" # "N+1 queries in the new recommendation engine"
secondary: "" # "Connection pool exhaustion under load"
evidence: "" # "Slow query log shows 200+ queries per request"
```
### Performance Budget Framework
Set budgets BEFORE building, not after complaints:
| Metric | Web App | API | Mobile | Batch Job |
|--------|---------|-----|--------|-----------|
| P50 response | <200ms | <100ms | <300ms | N/A |
| P95 response | <500ms | <250ms | <800ms | N/A |
| P99 response | <1s | <500ms | <1.5s | N/A |
| Error rate | <0.1% | <0.01% | <0.5% | <0.001% |
| Time to Interactive | <3s | N/A | <2s | N/A |
| Memory per request | <50MB | <20MB | <100MB | <1GB |
| CPU per request | <100ms | <50ms | <200ms | N/A |
| Throughput | 100+ rps | 500+ rps | N/A | items/min |
## Phase 2: Measurement & Profiling
### The Golden Rule
**Never optimize without measuring first. Never measure without a hypothesis.**
### Profiling Decision Tree
```
Is it slow?
├── YES → Where is time spent?
│ ├── CPU-bound → Profile CPU (flame graph)
│ │ ├── Hot function found → Optimize algorithm/data structure
│ │ └── Spread evenly → Architecture problem (too many layers)
│ ├── I/O-bound → Profile I/O
│ │ ├── Database → Query analysis (Phase 4)
│ │ ├── Network → Connection profiling
│ │ ├── Disk → I/O scheduler + buffering
│ │ └── External API → Caching + async + circuit breaker
│ ├── Memory-bound → Profile allocations
│ │ ├── GC pressure → Reduce allocations, pool objects
│ │ ├── Memory leak → Heap snapshot comparison
│ │ └── Cache thrashing → Resize or eviction policy
│ └── Concurrency-bound → Profile locks/contention
│ ├── Lock contention → Reduce critical section, lock-free structures
│ ├── Thread starvation → Pool sizing
│ └── Deadlock → Lock ordering analysis
└── NO → Define "fast enough" (see budgets above)
```
### CPU Profiling by Language
#### Node.js
```bash
# Built-in profiler (V8)
node --prof app.js
node --prof-process isolate-*.log > profile.txt
# Inspector-based (connect Chrome DevTools)
node --inspect app.js
# Open chrome://inspect → Profiler → Start
# Clinic.js (best overall Node.js profiler)
npx clinic doctor -- node app.js
npx clinic flame -- node app.js # Flame graph
npx clinic bubbleprof -- node app.js # Async bottlenecks
# 0x (flame graphs)
npx 0x app.js
```
#### Python
```python
# cProfile (built-in)
import cProfile
import pstats
profiler = cProfile.Profile()
profiler.enable()
# ... code to profile ...
profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(20) # Top 20
# Line profiler (pip install line-profiler)
# Add @profile decorator, then:
# kernprof -l -v script.py
# py-spy (sampling profiler, no code changes)
# pip install py-spy
# py-spy top --pid <PID>
# py-spy record -o profile.svg --pid <PID> # Flame graph
# Scalene (CPU + memory + GPU)
# pip install scalene
# scalene script.py
```
#### Go
```go
// Built-in pprof
import (
"net/http"
_ "net/http/pprof"
"runtime/pprof"
)
// HTTP server (add to existing server)
// Access: http://localhost:6060/debug/pprof/
go func() { http.ListenAndServe(":6060", nil) }()
// CLI analysis
// go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
// go tool pprof -http=:8080 profile.out # Web UI
```
#### Java
```bash
# async-profiler (best for JVM)
# https://github.com/async-profiler/async-profiler
./asprof -d 30 -f profile.html <PID>
# JFR (built-in since JDK 11)
java -XX:StartFlightRecording=duration=60s,filename=rec.jfr MyApp
jfr print --events CPULoad rec.jfr
# jstack (thread dump)
jstack <PID> > threads.txt
```
### Memory Profiling
#### Leak Detection Pattern (any language)
```
1. Take heap snapshot at T0
2. Run suspected operation N times
3. Force GC
4. Take heap snapshot at T1
5. Compare: objects that grew = potential leak
6. Check: are they reachable? From where? (retention path)
```
#### Node.js Memory
```javascript
// Heap snapshot
const v8 = require('v8');
const fs = require('fs');
function takeSnapshot(label) {
const snapshotStream = v8.writeHeapSnapshot();
console.log(`Heap snapshot written to ${snapshotStream}`);
}
// Process memory monitoring
setInterval(() => {
const mem = process.memoryUsage();
console.log({
rss_mb: (mem.rss / 1048576).toFixed(1),
heap_used_mb: (mem.heapUsed / 1048576).toFixed(1),
heap_total_mb: (mem.heapTotal / 1048576).toFixed(1),
external_mb: (mem.external / 1048576).toFixed(1),
});
}, 10000);
```
#### Python Memory
```python
# tracemalloc (built-in)
import tracemalloc
tracemalloc.start()
# ... code ...
snapshot = tracemalloc.take_snapshot()
top = snapshot.statistics('lineno')
for stat in top[:10]:
print(stat)
# objgraph (pip install objgraph)
import objgraph
objgraph.show_most_common_types(limit=20)
objgraph.show_growth(limit=10) # Call twice to see what's growing
```
### Flame Graph Interpretation
```
Reading a flame graph:
┌─────────────────────────────────────────────┐
│ main() │ ← Entry point (bottom)
├──────────────────────┬──────────────────────┤
│ processData() │ renderOutput() │ ← Width = time spent
├──────────┬───────────┤ │
│ parseCSV │ validate │ │ ← Tall = deep call stack
├──────────┤ │ │
│ readline │ │ │ ← Top = where CPU burns
└──────────┴───────────┴──────────────────────┘
WHAT TO LOOK FOR:
1. Wide plateaus at top → CPU-intensive leaf function (optimize this!)
2. Many thin towers → excessive function calls (batch or reduce)
3. Recursive patterns → potential stack overflow risk
4. Unexpected width → function taking more time than expected
5. GC/runtime frames → memory pressure
ACTION RULES:
- Plateau >20% width → must investigate
- Plateau >40% width → almost certainly the bottleneck
- If top 3 functions = 80% of time → focused optimization will work
- If evenly distributed → architectural change needed
```
## Phase 3: Common Optimization Patterns
### Algorithm & Data Structure Optimizations
| Problem | Bad O() | Fix | Good O() |
|---------|---------|-----|----------|
| Search unsorted array | O(n) | Sort + binary search, or use Set/Map | O(log n) or O(1) |
| Nested loop matching | O(n²) | Hash map lookup | O(n) |
| Repeated string concat | O(n²) | StringBuilder/join array | O(n) |
| Sorting already-sorted data | O(n log n) | Check if sorted first | O(n) |
| Finding duplicates | O(n²) | Set-based detection | O(n) |
| Frequent min/max of changing data | O(n) per query | Heap/priority queue | O(log n) |
### Caching Strategy Decision Matrix
```
Should you cache