I set up an inference server so I can hit my own open-weight models from my laptop anywhere, with nothing exposed to the public internet. Sharing in case it's useful to others, and to hear how people are doing this differently.
The request path:
Client (laptop on Tailscale) → Tailscale Aperture (AI gateway — auth + routes by model name) → llama-swap → vLLM → GPU
What I like about it:
- Access runs over Tailscale, so it's end-to-end encrypted and gated by OAuth. No open ports and no reverse proxy to babysit.
- llama-swap loads models on demand: if the requested model isn't running, it starts a vLLM child process, and if a model sits idle for ~5 min, it kills it to free VRAM. Useful when juggling models on one box.
- vLLM handles inference (currently Qwen3.6 27B).
I can also just SSH in to work directly on the GPU — adding models, fine-tuning, and so on.
I set up an inference server so I can hit my own open-weight models from my laptop anywhere, with nothing exposed to the public internet. Sharing in case it's useful to others, and to hear how people are doing this differently.
The request path: Client (laptop on Tailscale) → Tailscale Aperture (AI gateway — auth + routes by model name) → llama-swap → vLLM → GPU
What I like about it: - Access runs over Tailscale, so it's end-to-end encrypted and gated by OAuth. No open ports and no reverse proxy to babysit.
- llama-swap loads models on demand: if the requested model isn't running, it starts a vLLM child process, and if a model sits idle for ~5 min, it kills it to free VRAM. Useful when juggling models on one box.
- vLLM handles inference (currently Qwen3.6 27B).
I can also just SSH in to work directly on the GPU — adding models, fine-tuning, and so on.