Show HN: LLMKube – Kubernetes for Local LLMs with GPU Acceleration

github.com

5 points by defilan 6 hours ago

Hi HN! I built LLMKube, a Kubernetes operator for deploying GPU-accelerated LLMs in production. One command gets you from zero to inference with full observability.

Why this exists: Regulated industries (healthcare, defense, finance) need air-gapped LLM deployments, but existing tools are either single-node only (Ollama) or lack GPU optimization and SLO enforcement. LLMKube bridges the gap.

What's working:

- 17x speedup with NVIDIA GPUs (64 tok/s on Llama 3.2 3B vs 4.6 tok/s CPU)

- One command: llmkube deploy llama-3b --gpu (auto CUDA setup, scheduling, layer offloading)

- Production observability: Prometheus + Grafana + DCGM GPU metrics out of the box

- OpenAI-compatible API endpoints

- Terraform configs for GKE GPU clusters with auto-scale to zero

Tech: Kubernetes CRDs, llama.cpp with CUDA, NVIDIA GPU Operator, cost-optimized spot instances (~$50-150/mo dev workloads).

Status: v0.2.0 production-ready for single-GPU deployments on standard K8s clusters. Multi-GPU and multi-node model sharding on the roadmap.

Apache 2.0 licensed. Would love feedback from anyone running LLMs in production!

Website: https://llmkube.com

GitHub: https://github.com/Defilan/LLMKube