homelab-cluster
管理家庭实验室的多层 AI 推理集群。健康监测、专家 MoE 路由、 自动节点恢复以及跨 Ollama 和 llama.cpp 节点的模型部署。涵盖GPU 内存规划、大型模型的 Docker 卷策略、顺序启动模式 避免 CUDA 死锁,并通过 LiteLLM 统一 API 网关。
安装 / 下载方式
TotalClaw CLI推荐
totalclaw install totalclaw:totalclaw~mlesnews-homelab-clustercURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~mlesnews-homelab-cluster/file -o mlesnews-homelab-cluster.md## 概述(中文)
管理家庭实验室的多层 AI 推理集群。健康监测、专家 MoE 路由、
自动节点恢复以及跨 Ollama 和 llama.cpp 节点的模型部署。涵盖GPU
内存规划、大型模型的 Docker 卷策略、顺序启动模式
避免 CUDA 死锁,并通过 LiteLLM 统一 API 网关。
## 原文
# Homelab Cluster Management
Manage a compound AI compute cluster spanning multiple tiers of GPU and CPU inference nodes.
Built and battle-tested by [Lumina Homelab](https://luminahomelab.ai).
## When to Use
Use this skill when your agent needs to:
- Monitor health of distributed model endpoints
- Route inference requests to the best available model
- Recover downed nodes automatically
- Plan GPU memory allocation across models
- Deploy models across heterogeneous hardware
## Architecture Pattern
A homelab cluster typically spans 2-3 tiers:
| Tier | Typical Hardware | Runtime | Role |
|------|-----------------|---------|------|
| **Local** | Primary GPU (RTX 4090/5090) | Ollama | Fast inference, embeddings |
| **Remote** | Secondary GPU (RTX 3090/4090) | llama.cpp or Ollama | Distributed inference |
| **NAS/CPU** | Synology, RPi, any CPU node | Ollama | Lightweight models, fallback |
A **LiteLLM proxy** sits in front, providing a unified OpenAI-compatible API across all tiers.
## Health Monitoring
Check all endpoints with configurable per-endpoint timeouts:
```bash
# Define endpoints with tier labels
ENDPOINTS = {
"local/ollama": {"url": "http://localhost:11434/api/tags", "tier": "LOCAL"},
"remote/mark-i": {"url": "http://REMOTE_IP:3009/v1/models", "tier": "REMOTE", "timeout": 8},
"gateway/litellm": {"url": "http://localhost:8080/health/liveliness", "tier": "GATEWAY"},
}
# For each endpoint: GET with timeout, check HTTP 200
# Classify: HEALTHY / DEGRADED / DOWN per tier
# Overall prognosis based on tier health
```
**Key lesson:** Use `/health/liveliness` for LiteLLM, not `/health` — the latter probes all model routes and hangs if any are unreachable.
## Expert MoE Routing
Route requests to the optimal model based on task classification:
```
Task Categories:
code → Coder model (Qwen2.5-Coder-7B or similar)
reason → Reasoning model (DeepSeek-R1-Distill or similar)
chat → General model (Qwen2.5-14B or similar)
vision → Vision model (Qwen2.5-VL or similar)
fast → Smallest available model for quick responses
embed → Embedding model (nomic-embed-text or similar)
Router logic:
1. Classify task from prompt
2. Check health of preferred model
3. Fallback to next-best if unavailable
4. Return model endpoint + metadata
```
## Docker Deployment (llama.cpp on Remote Nodes)
### Critical: Use Docker Volumes, Not Bind Mounts
For models larger than ~1.5GB on Windows Docker hosts:
```bash
# Create a Docker volume for model storage
docker volume create models-vol
# Copy models INTO the volume
docker run --rm -v models-vol:/models -v /host/path:/src alpine cp /src/model.gguf /models/
# Run container FROM volume (not bind mount)
docker run -d --gpus all -v models-vol:/models -p 3009:8000 \
-e MODEL_PATH=/models/model.gguf your-llamacpp-image
```
**Why:** Windows bind mounts use gRPC-FUSE/9P bridge which hangs during GPU tensor loading for large files. Docker volumes use native Linux ext4 and bypass this entirely.
### Sequential Container Startup
Never start multiple GPU containers simultaneously:
```bash
# WRONG — causes CUDA initialization deadlock
docker start mark-i mark-iii mark-iv mark-vi &
# RIGHT — sequential with health check between each
for container in mark-v mark-iii mark-iv mark-vi mark-i; do
docker restart $container
sleep 5
# Verify health before starting next
curl -s http://localhost:PORT/v1/models || echo "Warning: $container slow to start"
done
```
### GPU Memory Planning
Plan your model lineup to fit within VRAM:
```
Example for 24GB GPU:
14B model (Q4_K_M) → 9.0 GB, 28 GPU layers
7B coder → 4.4 GB, full GPU
8B reasoning → 4.6 GB, full GPU
1.5B fast coder → 1.1 GB, full GPU
1.7B fast chat → 1.0 GB, full GPU
─────────────────────────────
Total: 20.1 GB (~84% utilized)
Remaining: CPU-only containers for 32B+ models
```
## Automatic Node Recovery
When a remote node goes down (Docker Desktop crash, reboot, etc.):
```
Recovery sequence:
1. Health check fails for remote tier
2. Check if SSH is responsive (node is up but Docker is down)
3. If SSH works: restart Docker Desktop via SSH
4. If SSH fails: create RDP session to wake the machine
5. Wait for Docker + sequential container restart
6. Re-check health
```
**Important:** Never store recovery credentials in plaintext. Use a vault (Azure Key Vault, HashiCorp Vault, etc.) and pipe secrets through stdin, never as CLI arguments.
## LiteLLM Gateway Configuration
Unified API across all tiers:
```yaml
model_list:
# Local Ollama models
- model_name: local/chat
litellm_params:
model: ollama/qwen2.5:32b
api_base: http://localhost:11434
# Remote llama.cpp models (need openai/ prefix)
- model_name: remote/mark-i
litellm_params:
model: openai/qwen2.5-14b-instruct
api_base: http://REMOTE_IP:3009/v1
api_key: "not-needed"
# NAS Ollama models
- model_name: nas/coder
litellm_params:
model: ollama/qwen2.5-coder:7b
api_base: http://NAS_IP:11434
```
**Key:** llama.cpp endpoints need the `openai/` prefix in model name and `/v1` in api_base for LiteLLM compatibility.
## Links
- **Lumina Homelab:** [luminahomelab.ai](https://luminahomelab.ai)
- **X/Twitter:** [@HK47LUMINA](https://x.com/HK47LUMINA)
- **GitHub:** [mlesnews](https://github.com/mlesnews)