LLMhop
One port, many models: A tiny, stateless HTTP router for OpenAI-compatible LLM inference backends.
LLMhop peeks at the model field of an incoming OpenAI-compatible request and reverse-proxies it to the matching backend.
It is primarily designed for single-model inference servers like vLLM and sglang that serve one model per process and need a thin model-aware gateway in front of them, but it works with any OpenAI-compatible backend (including multi-model servers and hosted providers) whenever you want to consolidate several upstreams behind a single endpoint.
Features
- OpenAI-compatible reverse proxy, model router and request dispatcher for self-hosted LLM inference.
- Stateless single-binary HTTP service: no database, no cache, no background workers, safe behind any load balancer.
- Zero external dependencies: pure Go, no third-party packages, no CGO.
- Works with any OpenAI API-compatible backend, self-hosted or remote: vLLM, sglang, TabbyAPI, Aphrodite, Ollama, LocalAI, OpenRouter, together.ai, DeepInfra, etc.
- Ships as a static binary, a minimal Docker image and a hardened NixOS module that can optionally spin up llama.cpp, sglang or vLLM workers alongside the router.
How it works
- Client sends a request with a JSON body containing
{"model": "..."}. - LLMhop reads the
modelfield and looks it up in its config. - The request is forwarded verbatim to the configured backend URL.
- Unknown models return
404.
Authentication
LLMhop can optionally gate incoming requests with a list of bearer tokens and inject per-model Authorization (or any other) headers when forwarding to the backend.
Both sides are opt-in: leave authTokens and models.*.headers unset and headers are forwarded verbatim.
When authTokens is set, the router validates the incoming Authorization: Bearer <token> header (constant-time compare) and then strips it before forwarding, so the client-facing token never leaks upstream.
Per-model headers are applied last, so a configured Authorization always wins over whatever the client sent.
Configuration
Create a config.json:
{
"listen": ":8080",
"authTokens": ["${file:client_token}"],
"models": {
"llama-3-8b": {
"url": "http://localhost:30000"
},
"openai-gpt-4o": {
"url": "https://api.openai.com",
"headers": {
"Authorization": "Bearer ${env:OPENAI_KEY}"
}
}
}
}
Secret references
String values inside authTokens and models.*.headers are expanded at startup, so no plaintext secret ever has to live in the config file:
${env:NAME}: read from theNAMEenvironment variable.${file:path}: read from a file. Relative paths are resolved against$CREDENTIALS_DIRECTORYwhen set (e.g. when launched by systemd withLoadCredential=), otherwise against the current working directory. A single trailing newline is trimmed.$NAME: shorthand for${env:NAME}.
Unresolved references are a hard startup error.
Request size limit
LLMhop buffers each request body in memory so it can peek at the model field before forwarding.
To keep a single request from exhausting memory, the body is capped at 100 MiB by default; bodies beyond the cap are rejected with 413 Request Entity Too Large.
Override it when vision or other multimodal payloads need more:
{ "maxBodyBytes": 524288000 }
Running
# native
llmhop --config config.json
# nix
nix run github:mirkolenz/llmhop -- --config config.json
# docker
docker run --rm -p 8080:8080 -v ./config.json:/config.json ghcr.io/mirkolenz/llmhop --config /config.json
NixOS module
A hardened systemd service is provided out of the box. Add LLMhop to your flake inputs and import the module into your system configuration:
{
inputs = {
nixpkgs.url = "github:nixos/nixpkgs/nixos-unstable";
llmhop = {
url = "github:mirkolenz/llmhop";
inputs.nixpkgs.follows = "nixpkgs";
};
};
outputs =
{ nixpkgs, llmhop, ... }:
{
nixosConfigurations.myhost = nixpkgs.lib.nixosSystem {
system = "x86_64-linux";
modules = [
llmhop.nixosModules.default
{
services.llmhop = {
enable = true;
settings = {
listen = ":8080";
models = {
"llama-3-8b".url = "http://localhost:30000";
"qwen-2.5-7b".url = "http://localhost:30001";
};
};
};
}
];
};
};
}
The unit runs under DynamicUser with aggressive sandboxing (ProtectSystem, PrivateTmp, restricted syscalls and address families, no new privileges, …) and restarts on failure.
Inference backends
The module can also run the inference servers themselves, so you don’t have to wire up llama.cpp, sglang or vLLM by hand.
Each backend exposes a models attrset under services.llmhop.<backend> and every entry becomes one isolated worker bound to a loopback port, with the matching route registered automatically with llmhop.
All three backends can be enabled side by side and mixed freely in the same configuration.
llama.cpp runs as a native, hardened systemd system unit under DynamicUser.
sglang and vLLM are launched as rootless Podman containers through quadlet-nix.
Each Quadlet backend gets a dedicated, lingering system user (sglang, vllm) that owns its cache directory, sub-UID range and rootless container store.
The container units are installed under that user’s per-UID search path and therefore run as systemd user units, not system units.
This is a deliberate workaround for NVIDIA/nvidia-container-toolkit#648:
nvidia-cdi-hook runs as an OCI createContainer hook inside the container’s user namespace and fails to read the OCI bundle’s config.json whenever Podman uses a UID-mapped namespace (e.g., --userns auto or --userns nomap), which is the mode you end up in when systemd’s system manager launches a rootless container.
Running each Quadlet unit under a real, lingering system user’s systemd instance keeps Podman in the keep-id-style mapping where the CDI hook can read the bundle and the GPU is correctly exposed.
No worker ever runs as root.
For convenience, the module injects a tiny per-backend helper into environment.systemPackages whenever the backend’s default user is used:
llama-cppworkers are plain system units, so they are managed with the usualsystemctl status llama-cpp-<model>andjournalctl -u llama-cpp-<model>.sglang-shellandvllm-shellarewriteShellApplicationwrappers aroundmachinectl shellthat drop you into the backend user’s session, wheresystemctl --user,journalctl --userandpodman pssee the worker units directly. Run them with no arguments for an interactive shell, or pass a command to execute it inside the session.
services.llmhop = {
enable = true;
llama-cpp = {
enable = true;
models."qwen3-8b" = {
port = 18001;
settings.hf-repo = "unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL";
};
};
sglang = {
enable = true;
models."qwen3-coder" = {
port = 19001;
model = "Qwen/Qwen3-8B";
settings.reasoning-parser = "qwen3";
};
};
vllm = {
enable = true;
models."llama-3-8b" = {
port = 20001;
model = "meta-llama/Meta-Llama-3-8B-Instruct";
};
};
};
See the options reference for the full list of per-backend options.
Secrets
The generated config file lives in the world-readable Nix store, so secrets should never be placed in services.llmhop.settings directly.
Instead, reference them via ${file:...} and hand the files to the service with systemd’s LoadCredential=.
The right-hand side of each LoadCredential entry is just a file path, so anything that produces a file works: agenix or sops-nix outputs, a manually-managed file under /etc/llmhop/, or a path emitted by your own secret-provisioning tool.
services.llmhop.settings = {
authTokens = [ "\${file:client_token}" ];
models."openai-gpt-4o" = {
url = "https://api.openai.com";
headers.Authorization = "Bearer \${env:OPENAI_KEY}";
};
};
systemd.services.llmhop.serviceConfig = {
LoadCredential = [ "client_token:/etc/llmhop/client-token" ];
EnvironmentFile = [ "/etc/llmhop/openai.env" ];
};
/etc/llmhop/openai.env is a plain KEY=VALUE file:
OPENAI_KEY=sk-...
${file:...} references are resolved against $CREDENTIALS_DIRECTORY, which systemd exposes as a per-unit tmpfs accessible only to this service, compatible with DynamicUser and the rest of the sandbox.
${env:...} picks up anything the unit inherits, typically via EnvironmentFile=.
Pick whichever matches how your secret tooling hands you the data; mixing both in one config is fine.
Core
services.llmhop.enable
Whether to enable llmhop reverse proxy.
Type: boolean
Default:
false
Example:
true
services.llmhop.package
The llmhop package to use.
Type: package
Default:
pkgs.callPackage ./package.nix { }
services.llmhop.settings
Configuration written to the JSON config file passed to llmhop.
See the upstream Config struct for available fields.
Type: JSON value
Default:
{ }
Example:
{
listen = ":8080";
models = {
gpt-4 = {
url = "https://api.openai.com";
};
};
}
llama-cpp
services.llmhop.llama-cpp.enable
Whether to enable llama.cpp model serving via systemd, fronted by llmhop.
Type: boolean
Default:
false
Example:
true
services.llmhop.llama-cpp.package
The llama-cpp package to use.
Type: package
Default:
pkgs.llama-cpp
services.llmhop.llama-cpp.environment
Environment variables set on every model service.
Merged with services.llmhop.llama-cpp.models.<name>.environment; per-model
entries take precedence.
Type: attribute set of string
Default:
{ }
services.llmhop.llama-cpp.environmentFile
File in KEY=VALUE format forwarded to every service.
Use for secrets managed by sops-nix/agenix, e.g. a file containing
HF_TOKEN=<token> to access gated Hugging Face repositories.
Loaded before services.llmhop.llama-cpp.models.<name>.environmentFile, so
per-model files override these entries.
Type: null or absolute path
Default:
null
Example:
"/etc/llama-cpp/.env"
services.llmhop.llama-cpp.modelSettings
CLI flags forwarded to the model server for every model.
true collapses to --<key>; null and false are dropped (write
the negated key explicitly, e.g. "no-mmap" = true;, when the upstream
CLI registers a --no-<key> form).
Merged with services.llmhop.llama-cpp.models.<name>.settings; per-model
entries take precedence.
Type: attribute set of anything
Default:
{ }
services.llmhop.llama-cpp.models
Models to serve.
Each entry produces one systemd service running llama-server; the
attribute name is the routing key surfaced through llmhop and the OpenAI
model field.
GPU selection is done via build-specific environment variables on
environment (top-level or per-model), since llama.cpp runs as a host
process — no CDI involved. Common variables: CUDA_VISIBLE_DEVICES
(CUDA), HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES (ROCm),
GGML_VK_VISIBLE_DEVICES (Vulkan), ZE_AFFINITY_MASK (SYCL).
Type: attribute set of (submodule)
Default:
{ }
Example:
{
"qwen3-8b" = {
port = 18001;
settings = {
hf-repo = "unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL";
temperature = 1.0;
top-k = 20;
};
# Pin this model to a specific GPU. The right variable depends on
# the llama.cpp build: CUDA_VISIBLE_DEVICES for CUDA,
# HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES for ROCm,
# GGML_VK_VISIBLE_DEVICES for Vulkan, ZE_AFFINITY_MASK for SYCL.
environment.CUDA_VISIBLE_DEVICES = "0";
};
}
services.llmhop.llama-cpp.models.<name>.enable
Whether to enable serving of model ‹name›.
Type: boolean
Default:
true
Example:
true
services.llmhop.llama-cpp.models.<name>.environment
Additional environment variables set on this model’s service.
Merged with services.llmhop.llama-cpp.environment; per-model entries
take precedence.
Type: attribute set of string
Default:
{ }
services.llmhop.llama-cpp.models.<name>.environmentFile
File in KEY=VALUE format forwarded to this model’s service.
Loaded after services.llmhop.llama-cpp.environmentFile, so its entries
override global ones. Must be readable by the user systemd reads it as.
Type: null or absolute path
Default:
null
services.llmhop.llama-cpp.models.<name>.name
Canonical identifier for this model. Used for the unit name
(llama-cpp-<name>) and as the routing key registered with llmhop
(clients select the backend by sending this value in the OpenAI
model field).
Defaults to the attribute key, so the key itself must match the required label format.
Type: string matching the pattern [[:alnum:]][[:alnum:].-]*
Default:
"‹name›"
services.llmhop.llama-cpp.models.<name>.port
Loopback host port that llama-server binds to. Must be unique per
enabled model; the gateway (llmhop) reaches each backend at
http://127.0.0.1:<port>.
Type: 16 bit unsigned integer; between 0 and 65535 (both inclusive)
services.llmhop.llama-cpp.models.<name>.settings
CLI flags forwarded to the model server for this model.
true collapses to --<key>; null and false are dropped (write
the negated key explicitly, e.g. "no-mmap" = true;, when the upstream
CLI registers a --no-<key> form).
Merged with services.llmhop.llama-cpp.modelSettings; per-model entries
take precedence.
Type: attribute set of anything
Default:
{ }
services.llmhop.llama-cpp.openFilesLimit
File descriptor limit (LimitNOFILE) applied to every llama-cpp systemd unit.
Increase if the server logs accept: Too many open files under concurrent load.
Type: positive integer, meaning >0
Default:
1048576
sglang
services.llmhop.sglang.enable
Whether to enable SGLang model serving via Quadlet, optionally fronted by the SGL Model Gateway.
Type: boolean
Default:
false
Example:
true
services.llmhop.sglang.cacheDir
Host directory bind-mounted as the Hugging Face cache for every worker.
Type: absolute path
Default:
"/var/cache/sglang"
services.llmhop.sglang.dataDir
Home directory of services.llmhop.sglang.user.
Used by rootless podman for container storage
(~/.local/share/containers), so it must live on a filesystem that
tolerates overlayfs.
Type: absolute path
Default:
"/var/lib/sglang"
services.llmhop.sglang.devices
Devices exposed to every model container — passed verbatim as Quadlet
AddDevice= lines. Accepts both CDI references (recommended:
nvidia.com/gpu=…, amd.com/gpu=…, intel.com/gpu=…, …) and raw
host device paths (e.g. /dev/dri/renderD128). For CDI, the
corresponding spec must be generated on the host (e.g.
nvidia-ctk cdi generate).
Defaults to [ "nvidia.com/gpu=all" ] when
hardware.nvidia-container-toolkit.enable is set, otherwise empty
(CPU-only). Per-model devices overrides this.
Type: list of string
Default:
if config.hardware.nvidia-container-toolkit.enable then
[ "nvidia.com/gpu=all" ]
else
[ ]
Example:
[
"amd.com/gpu=all"
]
services.llmhop.sglang.environment
Environment variables set on every model service.
Merged with services.llmhop.sglang.models.<name>.environment; per-model
entries take precedence.
Type: attribute set of string
Default:
{ }
services.llmhop.sglang.environmentFile
File in KEY=VALUE format forwarded to every service.
Use for secrets managed by sops-nix/agenix, e.g. a file containing
HF_TOKEN=<token> to access gated Hugging Face repositories.
Loaded before services.llmhop.sglang.models.<name>.environmentFile, so
per-model files override these entries.
Type: null or absolute path
Default:
null
Example:
"/etc/sglang/.env"
services.llmhop.sglang.gateway.enable
Whether to enable the SGL Model Gateway in front of the workers. Disabled by default — llmhop already routes between every backend, and the gateway is only needed when you want SGLang’s IGW dispatch features (custom routing, prefix caching across workers, etc.) .
Type: boolean
Default:
false
Example:
true
services.llmhop.sglang.gateway.enableMetrics
Whether to enable Prometheus metrics on the gateway.
Type: boolean
Default:
true
Example:
true
services.llmhop.sglang.gateway.bindAddress
Host address the gateway binds its listeners to. Defaults to the loopback so external clients must go through Caddy / llmhop.
Type: string
Default:
"127.0.0.1"
services.llmhop.sglang.gateway.digest
Immutable digest of the gateway image. Mutually exclusive with tag.
Type: null or string
Default:
null
services.llmhop.sglang.gateway.environment
Additional environment variables set on the gateway container.
Type: attribute set of string
Default:
{ }
services.llmhop.sglang.gateway.environmentFile
File in KEY=VALUE format forwarded to the gateway via --env-file.
Use for secrets like API keys; the gateway’s --api-key flag may also be passed via
settings if the value is non-secret.
Type: null or absolute path
Default:
null
Example:
"/etc/sglang/gateway.env"
services.llmhop.sglang.gateway.image
Container image used for the gateway.
Type: string
Default:
"docker.io/lmsysorg/sgl-model-gateway"
services.llmhop.sglang.gateway.metricsPort
Host port the gateway exposes Prometheus metrics on.
Ignored when enableMetrics is false.
Type: 16 bit unsigned integer; between 0 and 65535 (both inclusive)
Default:
29000
services.llmhop.sglang.gateway.port
Host port the gateway listens on.
Type: 16 bit unsigned integer; between 0 and 65535 (both inclusive)
services.llmhop.sglang.gateway.settings
Additional CLI flags forwarded to sgl-model-gateway.
true collapses to --<key>; null and false are dropped (write
the negated key explicitly when the upstream CLI registers one).
Type: attribute set of anything
Default:
{ }
Example:
{
api-key = "secret";
tls-cert-path = "/etc/sglang/tls/server.crt";
}
services.llmhop.sglang.gateway.tag
Default tag of the gateway image. Mutually exclusive with digest.
Type: null or string
Default:
"latest"
services.llmhop.sglang.gid
Host GID assigned to services.llmhop.sglang.group and used as the
inner-to-outer mapping target in --gidmap. Defaults to uid.
Type: unsigned integer, meaning >=0
Default:
config.services.llmhop.sglang.uid
services.llmhop.sglang.group
Primary group for services.llmhop.sglang.user.
Defaults to the user name (matching the typical 1:1 user/group layout).
Type: string
Default:
config.services.llmhop.sglang.user
services.llmhop.sglang.image
Container image used for every model worker.
Type: string
Default:
"docker.io/lmsysorg/sglang"
services.llmhop.sglang.modelSettings
CLI flags forwarded to the model server for every model.
true collapses to --<key>; null and false are dropped (write
the negated key explicitly, e.g. "no-mmap" = true;, when the upstream
CLI registers a --no-<key> form).
Merged with services.llmhop.sglang.models.<name>.settings; per-model
entries take precedence.
Type: attribute set of anything
Default:
{ }
services.llmhop.sglang.models
Models to serve.
Each entry produces one quadlet container; the attribute name is the routing key
(advertised via --served-model-name and surfaced through both llmhop and the
optional SGL Model Gateway as the OpenAI model field).
Enabled entries are sorted by ascending port.
Type: attribute set of (submodule)
Default:
{ }
Example:
{
"qwen3-8b" = {
model = "Qwen/Qwen3-8B";
port = 19001;
settings = {
reasoning-parser = "qwen3";
tool-call-parser = "qwen3_coder";
mem-fraction-static = 0.6;
cuda-graph-max-bs = 4;
};
};
}
services.llmhop.sglang.models.<name>.enable
Whether to enable serving of model ‹name›.
Type: boolean
Default:
true
Example:
true
services.llmhop.sglang.models.<name>.devices
Devices exposed to this model’s container — passed verbatim as
Quadlet AddDevice= lines. Replaces (does not extend)
services.llmhop.sglang.devices for this model.
Use to pin a model to specific device indices
(e.g. [ "nvidia.com/gpu=0" ]).
Type: list of string
Default:
config.services.llmhop.sglang.devices
Example:
[
"nvidia.com/gpu=0"
]
services.llmhop.sglang.models.<name>.digest
Immutable digest of the container image (e.g. sha256:…).
Mutually exclusive with tag.
Type: null or string
Default:
null
Example:
"sha256:a73fb0b9046fee099f7c1829d2548e6cc1740f4c2776a6855fa659ae5d0deb49"
services.llmhop.sglang.models.<name>.environment
Additional environment variables set on this model’s service.
Merged with services.llmhop.sglang.environment; per-model entries
take precedence.
Type: attribute set of string
Default:
{ }
services.llmhop.sglang.models.<name>.environmentFile
File in KEY=VALUE format forwarded to this model’s service.
Loaded after services.llmhop.sglang.environmentFile, so its entries
override global ones. Must be readable by the user systemd reads it as.
Type: null or absolute path
Default:
null
services.llmhop.sglang.models.<name>.model
Hugging Face repo id (or local path) passed to the model server.
Type: string
Example:
"Qwen/Qwen2.5-7B-Instruct"
services.llmhop.sglang.models.<name>.name
Canonical identifier for this model. Used for the unit name
(sglang-<name>) and as the routing key registered with llmhop
(clients select the backend by sending this value in the OpenAI
model field).
Defaults to the attribute key, so the key itself must match the required label format.
Type: string matching the pattern [[:alnum:]][[:alnum:].-]*
Default:
"‹name›"
services.llmhop.sglang.models.<name>.port
Loopback host port forwarded to the container’s SGLang API.
Must be unique per model and must not collide with gateway.port /
gateway.metricsPort when the gateway is enabled.
Type: 16 bit unsigned integer; between 0 and 65535 (both inclusive)
services.llmhop.sglang.models.<name>.settings
CLI flags forwarded to the model server for this model.
true collapses to --<key>; null and false are dropped (write
the negated key explicitly, e.g. "no-mmap" = true;, when the upstream
CLI registers a --no-<key> form).
Merged with services.llmhop.sglang.modelSettings; per-model entries
take precedence.
Type: attribute set of anything
Default:
{ }
services.llmhop.sglang.models.<name>.shmSize
Size of the container’s private /dev/shm tmpfs.
PyTorch and friends use shared memory for NCCL/tensor-parallel inference;
upstream recommends 32g (or --ipc=host). A private tmpfs is preferred for
isolation: raise the value for larger models or higher tensor-parallel sizes.
Type: string
Default:
"32g"
Example:
"64g"
services.llmhop.sglang.models.<name>.tag
Tag of the container image used for this model.
Mutually exclusive with digest.
Type: null or string
Default:
null
services.llmhop.sglang.openFilesLimit
File descriptor limit (LimitNOFILE) applied to every sglang systemd unit.
Increase if the server logs accept: Too many open files under concurrent load.
Type: positive integer, meaning >0
Default:
1048576
services.llmhop.sglang.startupOrdering
Whether to chain enabled model services by ascending port during startup.
GPU-memory profiling races otherwise: two workers booting on the same device
each see it as fully free and race to claim their share, leading to OOM.
Disable only when each model pins itself to a dedicated device via
its own devices.
Type: boolean
Default:
true
services.llmhop.sglang.subGidCount
Size of the subordinate GID range mapped into every container.
Defaults to subUidCount.
Type: positive integer, meaning >0
Default:
config.services.llmhop.sglang.subUidCount
services.llmhop.sglang.subGidStart
First host GID of the subordinate range mapped into every container.
Defaults to subUidStart — most setups keep the UID and GID ranges aligned.
Type: unsigned integer, meaning >=0
Default:
config.services.llmhop.sglang.subUidStart
services.llmhop.sglang.subUidCount
Size of the subordinate UID range mapped into every container. 65536 covers the full unprivileged ID space inside the namespace.
Type: positive integer, meaning >0
Default:
65536
services.llmhop.sglang.subUidStart
First host UID of the subordinate range mapped into every container.
Container UIDs ≥1 are mapped to subUidCount consecutive host IDs starting here.
Required — pick a value clear of NixOS system users (<1000), regular login
UIDs, and other backends’ subordinate ranges on the same host.
Type: unsigned integer, meaning >=0
Example:
300000
services.llmhop.sglang.tag
Default tag of the container image used for models that do not set their own
tag or digest.
Type: string
Example:
"latest"
services.llmhop.sglang.uid
Host UID assigned to services.llmhop.sglang.user and used as the
inner-to-outer mapping target in --uidmap.
Required — pick a value that does not clash with other system users on the
host.
Type: unsigned integer, meaning >=0
Example:
503
services.llmhop.sglang.user
Dedicated system user that owns the sglang cache directory and that
container root is mapped to via --uidmap. Defaults to the backend
name; override to point at a user the deployer manages externally
(in which case the matching users.users.<name> and
users.groups.<name> declarations become the deployer’s
responsibility).
Type: string
Default:
backend
vllm
services.llmhop.vllm.enable
Whether to enable vLLM model serving via Quadlet, fronted by llmhop.
Type: boolean
Default:
false
Example:
true
services.llmhop.vllm.cacheDir
Host directory bind-mounted as the Hugging Face cache for every worker.
Type: absolute path
Default:
"/var/cache/vllm"
services.llmhop.vllm.dataDir
Home directory of services.llmhop.vllm.user.
Used by rootless podman for container storage
(~/.local/share/containers), so it must live on a filesystem that
tolerates overlayfs.
Type: absolute path
Default:
"/var/lib/vllm"
services.llmhop.vllm.devices
Devices exposed to every model container — passed verbatim as Quadlet
AddDevice= lines. Accepts both CDI references (recommended:
nvidia.com/gpu=…, amd.com/gpu=…, intel.com/gpu=…, …) and raw
host device paths (e.g. /dev/dri/renderD128). For CDI, the
corresponding spec must be generated on the host (e.g.
nvidia-ctk cdi generate).
Defaults to [ "nvidia.com/gpu=all" ] when
hardware.nvidia-container-toolkit.enable is set, otherwise empty
(CPU-only). Per-model devices overrides this.
Type: list of string
Default:
if config.hardware.nvidia-container-toolkit.enable then
[ "nvidia.com/gpu=all" ]
else
[ ]
Example:
[
"amd.com/gpu=all"
]
services.llmhop.vllm.environment
Environment variables set on every model service.
Merged with services.llmhop.vllm.models.<name>.environment; per-model
entries take precedence.
Type: attribute set of string
Default:
{ }
services.llmhop.vllm.environmentFile
File in KEY=VALUE format forwarded to every service.
Use for secrets managed by sops-nix/agenix, e.g. a file containing
HF_TOKEN=<token> to access gated Hugging Face repositories.
Loaded before services.llmhop.vllm.models.<name>.environmentFile, so
per-model files override these entries.
Type: null or absolute path
Default:
null
Example:
"/etc/vllm/.env"
services.llmhop.vllm.gid
Host GID assigned to services.llmhop.vllm.group and used as the
inner-to-outer mapping target in --gidmap. Defaults to uid.
Type: unsigned integer, meaning >=0
Default:
config.services.llmhop.vllm.uid
services.llmhop.vllm.group
Primary group for services.llmhop.vllm.user.
Defaults to the user name (matching the typical 1:1 user/group layout).
Type: string
Default:
config.services.llmhop.vllm.user
services.llmhop.vllm.image
Container image used for every model worker.
Type: string
Default:
"docker.io/vllm/vllm-openai"
services.llmhop.vllm.modelSettings
CLI flags forwarded to the model server for every model.
true collapses to --<key>; null and false are dropped (write
the negated key explicitly, e.g. "no-mmap" = true;, when the upstream
CLI registers a --no-<key> form).
Merged with services.llmhop.vllm.models.<name>.settings; per-model
entries take precedence.
Type: attribute set of anything
Default:
{ }
services.llmhop.vllm.models
Models to serve.
Each entry produces one quadlet container; the attribute name is the routing key.
Enabled entries are sorted by ascending port.
Type: attribute set of (submodule)
Default:
{ }
Example:
{
"qwen2-5-7b" = {
model = "Qwen/Qwen2.5-7B-Instruct";
port = 18001;
};
"llama-3-8b" = {
model = "meta-llama/Meta-Llama-3-8B-Instruct";
port = 18002;
settings.max-model-len = 8192;
};
}
services.llmhop.vllm.models.<name>.enable
Whether to enable serving of model ‹name›.
Type: boolean
Default:
true
Example:
true
services.llmhop.vllm.models.<name>.devices
Devices exposed to this model’s container — passed verbatim as
Quadlet AddDevice= lines. Replaces (does not extend)
services.llmhop.vllm.devices for this model.
Use to pin a model to specific device indices
(e.g. [ "nvidia.com/gpu=0" ]).
Type: list of string
Default:
config.services.llmhop.vllm.devices
Example:
[
"nvidia.com/gpu=0"
]
services.llmhop.vllm.models.<name>.digest
Immutable digest of the container image (e.g. sha256:…).
Mutually exclusive with tag.
Type: null or string
Default:
null
Example:
"sha256:a73fb0b9046fee099f7c1829d2548e6cc1740f4c2776a6855fa659ae5d0deb49"
services.llmhop.vllm.models.<name>.environment
Additional environment variables set on this model’s service.
Merged with services.llmhop.vllm.environment; per-model entries
take precedence.
Type: attribute set of string
Default:
{ }
services.llmhop.vllm.models.<name>.environmentFile
File in KEY=VALUE format forwarded to this model’s service.
Loaded after services.llmhop.vllm.environmentFile, so its entries
override global ones. Must be readable by the user systemd reads it as.
Type: null or absolute path
Default:
null
services.llmhop.vllm.models.<name>.model
Hugging Face repo id (or local path) passed to the model server.
Type: string
Example:
"Qwen/Qwen2.5-7B-Instruct"
services.llmhop.vllm.models.<name>.name
Canonical identifier for this model. Used for the unit name
(vllm-<name>) and as the routing key registered with llmhop
(clients select the backend by sending this value in the OpenAI
model field).
Defaults to the attribute key, so the key itself must match the required label format.
Type: string matching the pattern [[:alnum:]][[:alnum:].-]*
Default:
"‹name›"
services.llmhop.vllm.models.<name>.port
Loopback host port forwarded to the container’s vLLM API. Must be unique per model.
Type: 16 bit unsigned integer; between 0 and 65535 (both inclusive)
services.llmhop.vllm.models.<name>.settings
CLI flags forwarded to the model server for this model.
true collapses to --<key>; null and false are dropped (write
the negated key explicitly, e.g. "no-mmap" = true;, when the upstream
CLI registers a --no-<key> form).
Merged with services.llmhop.vllm.modelSettings; per-model entries
take precedence.
Type: attribute set of anything
Default:
{ }
services.llmhop.vllm.models.<name>.shmSize
Size of the container’s private /dev/shm tmpfs.
PyTorch and friends use shared memory for NCCL/tensor-parallel inference;
upstream recommends 32g (or --ipc=host). A private tmpfs is preferred for
isolation: raise the value for larger models or higher tensor-parallel sizes.
Type: string
Default:
"32g"
Example:
"64g"
services.llmhop.vllm.models.<name>.tag
Tag of the container image used for this model.
Mutually exclusive with digest.
Type: null or string
Default:
null
services.llmhop.vllm.openFilesLimit
File descriptor limit (LimitNOFILE) applied to every vllm systemd unit.
Increase if the server logs accept: Too many open files under concurrent load.
Type: positive integer, meaning >0
Default:
1048576
services.llmhop.vllm.startupOrdering
Whether to chain enabled model services by ascending port during startup.
GPU-memory profiling races otherwise: two workers booting on the same device
each see it as fully free and race to claim their share, leading to OOM.
Disable only when each model pins itself to a dedicated device via
its own devices.
Type: boolean
Default:
true
services.llmhop.vllm.subGidCount
Size of the subordinate GID range mapped into every container.
Defaults to subUidCount.
Type: positive integer, meaning >0
Default:
config.services.llmhop.vllm.subUidCount
services.llmhop.vllm.subGidStart
First host GID of the subordinate range mapped into every container.
Defaults to subUidStart — most setups keep the UID and GID ranges aligned.
Type: unsigned integer, meaning >=0
Default:
config.services.llmhop.vllm.subUidStart
services.llmhop.vllm.subUidCount
Size of the subordinate UID range mapped into every container. 65536 covers the full unprivileged ID space inside the namespace.
Type: positive integer, meaning >0
Default:
65536
services.llmhop.vllm.subUidStart
First host UID of the subordinate range mapped into every container.
Container UIDs ≥1 are mapped to subUidCount consecutive host IDs starting here.
Required — pick a value clear of NixOS system users (<1000), regular login
UIDs, and other backends’ subordinate ranges on the same host.
Type: unsigned integer, meaning >=0
Example:
300000
services.llmhop.vllm.tag
Default tag of the container image used for models that do not set their own
tag or digest.
Type: string
Example:
"v0.11.0"
services.llmhop.vllm.uid
Host UID assigned to services.llmhop.vllm.user and used as the
inner-to-outer mapping target in --uidmap.
Required — pick a value that does not clash with other system users on the
host.
Type: unsigned integer, meaning >=0
Example:
503
services.llmhop.vllm.user
Dedicated system user that owns the vllm cache directory and that
container root is mapped to via --uidmap. Defaults to the backend
name; override to point at a user the deployer manages externally
(in which case the matching users.users.<name> and
users.groups.<name> declarations become the deployer’s
responsibility).
Type: string
Default:
backend