Context
A 15-container private AI platform produces a lot of telemetry, Claude Code subscription usage, GPU pressure on the inference container, queue depth on the content pipelines, build status across the in-house projects, certificate expiry dates, network reachability across the two Proxmox nodes. Most of it sits in journalctl logs nobody reads.
The dashboard is what gets opened first thing in the morning. It exists so that the answer to “is anything broken?” is a glance instead of an SSH session.
Brief
- Single page, single screen, no scroll on a 1080p monitor.
- Sub-200ms data refresh, nothing async, nothing waiting on a slow API.
- Live Claude Code usage tracking (read from the local CLI cache).
- Per-container health (CPU, memory, last restart).
- Project queue plans (current pipeline state across FB-Media + ContentForge).
- Auth, single user (me), Basic Auth bcrypt was correct for the threat model.
- Hosted on the platform itself, not a SaaS, the dashboard cannot depend on the thing it monitors.
Architecture
Vite + React frontend. Fastify backend on Node, SQLite for the small bit of state (alert thresholds, snoozed warnings). PM2 keeps the Node process alive across reboots; runs in CT 208 on the services-tier Proxmox node.
The cleverest piece is the Gaming PC sidecar. The Claude Code CLI’s usage data lives on a Windows WSL2 host that the LXC containers can’t see directly. A small Node sidecar (claude-sidecar systemd service, port 11436) on the Gaming PC reads the local CLI cache + serves a JSON endpoint over the LAN. The dashboard polls it every 5 seconds.
A Windows portproxy bridges the LAN port to the WSL2 internal address, the workaround took an afternoon to land but means the dashboard sees real Claude usage in real time, not a guess based on log scraping.
Outcomes
- 15 LXC containers monitored across two Proxmox nodes.
- Sub-200ms refresh for the live cards (Claude usage, GPU, queue depth).
- Single-screen layout, everything visible at 1080p without scrolling.
- Zero downtime since launch, the dashboard ran through every kernel-upgrade reboot cycle.
- Operator overhead, the morning check-in went from 5 minutes of
pct list+nvidia-smi+journalctlto a 10-second glance.
Screens
[FILL: replace with screenshots of the live dashboard, main screen + Claude usage card + GPU pressure card + queue plans card. Anonymize any container hostnames if needed.]
What’s next
Two items on the next-iteration list:
- Push the WSL2 sidecar logic into a proper service account on the Gaming PC instead of running it under the developer login. The SSH chain that grants the dashboard read access to the CLI cache is currently a single point of failure if the developer login expires.
- Per-container alert rules, current alerting is threshold-only (CPU > 90%, memory > 95%). A small rules engine (e.g. Prometheus AlertManager-lite) would let me tag noisy alerts as snoozed without losing the underlying signal.