homelab-cluster

TotalClaw 作者 mlesnews v1.0.0

管理家庭实验室的多层 AI 推理集群。健康监测、专家 MoE 路由、自动节点恢复以及跨 Ollama 和 llama.cpp 节点的模型部署。涵盖GPU 内存规划、大型模型的 Docker 卷策略、顺序启动模式避免 CUDA 死锁，并通过 LiteLLM 统一 API 网关。

安装 / 下载方式

TotalClaw CLI推荐

totalclaw install totalclaw:totalclaw~mlesnews-homelab-cluster

cURL直接下载，无需登录

curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~mlesnews-homelab-cluster/file -o mlesnews-homelab-cluster.md

## 概述（中文）

管理家庭实验室的多层 AI 推理集群。健康监测、专家 MoE 路由、
自动节点恢复以及跨 Ollama 和 llama.cpp 节点的模型部署。涵盖GPU
内存规划、大型模型的 Docker 卷策略、顺序启动模式
避免 CUDA 死锁，并通过 LiteLLM 统一 API 网关。

## 原文

# Homelab Cluster Management

Manage a compound AI compute cluster spanning multiple tiers of GPU and CPU inference nodes.
Built and battle-tested by [Lumina Homelab](https://luminahomelab.ai).

## When to Use

Use this skill when your agent needs to:
- Monitor health of distributed model endpoints
- Route inference requests to the best available model
- Recover downed nodes automatically
- Plan GPU memory allocation across models
- Deploy models across heterogeneous hardware

## Architecture Pattern

A homelab cluster typically spans 2-3 tiers:

| Tier | Typical Hardware | Runtime | Role |
|------|-----------------|---------|------|
| **Local** | Primary GPU (RTX 4090/5090) | Ollama | Fast inference, embeddings |
| **Remote** | Secondary GPU (RTX 3090/4090) | llama.cpp or Ollama | Distributed inference |
| **NAS/CPU** | Synology, RPi, any CPU node | Ollama | Lightweight models, fallback |

A **LiteLLM proxy** sits in front, providing a unified OpenAI-compatible API across all tiers.

## Health Monitoring

Check all endpoints with configurable per-endpoint timeouts:

```bash
# Define endpoints with tier labels
ENDPOINTS = {
    "local/ollama": {"url": "http://localhost:11434/api/tags", "tier": "LOCAL"},
    "remote/mark-i": {"url": "http://REMOTE_IP:3009/v1/models", "tier": "REMOTE", "timeout": 8},
    "gateway/litellm": {"url": "http://localhost:8080/health/liveliness", "tier": "GATEWAY"},
}

# For each endpoint: GET with timeout, check HTTP 200
# Classify: HEALTHY / DEGRADED / DOWN per tier
# Overall prognosis based on tier health
```

**Key lesson:** Use `/health/liveliness` for LiteLLM, not `/health` — the latter probes all model routes and hangs if any are unreachable.

## Expert MoE Routing

Route requests to the optimal model based on task classification:

```
Task Categories:
  code     → Coder model (Qwen2.5-Coder-7B or similar)
  reason   → Reasoning model (DeepSeek-R1-Distill or similar)
  chat     → General model (Qwen2.5-14B or similar)
  vision   → Vision model (Qwen2.5-VL or similar)
  fast     → Smallest available model for quick responses
  embed    → Embedding model (nomic-embed-text or similar)

Router logic:
  1. Classify task from prompt
  2. Check health of preferred model
  3. Fallback to next-best if unavailable
  4. Return model endpoint + metadata
```

## Docker Deployment (llama.cpp on Remote Nodes)

### Critical: Use Docker Volumes, Not Bind Mounts

For models larger than ~1.5GB on Windows Docker hosts:

```bash
# Create a Docker volume for model storage
docker volume create models-vol

# Copy models INTO the volume
docker run --rm -v models-vol:/models -v /host/path:/src alpine cp /src/model.gguf /models/

# Run container FROM volume (not bind mount)
docker run -d --gpus all -v models-vol:/models -p 3009:8000 \
  -e MODEL_PATH=/models/model.gguf your-llamacpp-image
```

**Why:** Windows bind mounts use gRPC-FUSE/9P bridge which hangs during GPU tensor loading for large files. Docker volumes use native Linux ext4 and bypass this entirely.

### Sequential Container Startup

Never start multiple GPU containers simultaneously:

```bash
# WRONG — causes CUDA initialization deadlock
docker start mark-i mark-iii mark-iv mark-vi &

# RIGHT — sequential with health check between each
for container in mark-v mark-iii mark-iv mark-vi mark-i; do
  docker restart $container
  sleep 5
  # Verify health before starting next
  curl -s http://localhost:PORT/v1/models || echo "Warning: $container slow to start"
done
```

### GPU Memory Planning

Plan your model lineup to fit within VRAM:

```
Example for 24GB GPU:
  14B model (Q4_K_M)  →  9.0 GB, 28 GPU layers
  7B coder            →  4.4 GB, full GPU
  8B reasoning        →  4.6 GB, full GPU
  1.5B fast coder     →  1.1 GB, full GPU
  1.7B fast chat      →  1.0 GB, full GPU
  ─────────────────────────────
  Total:               20.1 GB (~84% utilized)

  Remaining: CPU-only containers for 32B+ models
```

## Automatic Node Recovery

When a remote node goes down (Docker Desktop crash, reboot, etc.):

```
Recovery sequence:
  1. Health check fails for remote tier
  2. Check if SSH is responsive (node is up but Docker is down)
  3. If SSH works: restart Docker Desktop via SSH
  4. If SSH fails: create RDP session to wake the machine
  5. Wait for Docker + sequential container restart
  6. Re-check health
```

**Important:** Never store recovery credentials in plaintext. Use a vault (Azure Key Vault, HashiCorp Vault, etc.) and pipe secrets through stdin, never as CLI arguments.

## LiteLLM Gateway Configuration

Unified API across all tiers:

```yaml
model_list:
  # Local Ollama models
  - model_name: local/chat
    litellm_params:
      model: ollama/qwen2.5:32b
      api_base: http://localhost:11434

  # Remote llama.cpp models (need openai/ prefix)
  - model_name: remote/mark-i
    litellm_params:
      model: openai/qwen2.5-14b-instruct
      api_base: http://REMOTE_IP:3009/v1
      api_key: "not-needed"

  # NAS Ollama models
  - model_name: nas/coder
    litellm_params:
      model: ollama/qwen2.5-coder:7b
      api_base: http://NAS_IP:11434
```

**Key:** llama.cpp endpoints need the `openai/` prefix in model name and `/v1` in api_base for LiteLLM compatibility.

## Links

- **Lumina Homelab:** [luminahomelab.ai](https://luminahomelab.ai)
- **X/Twitter:** [@HK47LUMINA](https://x.com/HK47LUMINA)
- **GitHub:** [mlesnews](https://github.com/mlesnews)