llama-cpp - llmhop

services.llmhop.llama-cpp.enable

Whether to enable llama.cpp model serving via systemd, fronted by llmhop.

Type: boolean

Default:

false

Example:

true

services.llmhop.llama-cpp.package

The llama-cpp package to use.

Type: package

Default:

pkgs.llama-cpp

services.llmhop.llama-cpp.environment

Environment variables set on every model service. Merged with services.llmhop.llama-cpp.models.<name>.environment; per-model entries take precedence.

Type: attribute set of string

Default:

{ }

services.llmhop.llama-cpp.environmentFile

File in KEY=VALUE format forwarded to every service. Use for secrets managed by sops-nix/agenix, e.g. a file containing HF_TOKEN=<token> to access gated Hugging Face repositories. Loaded before services.llmhop.llama-cpp.models.<name>.environmentFile, so per-model files override these entries.

Type: null or absolute path

Default:

null

Example:

"/etc/llama-cpp/.env"

services.llmhop.llama-cpp.modelSettings

CLI flags forwarded to the model server for every model. true collapses to --<key>; null and false are dropped (write the negated key explicitly, e.g. "no-mmap" = true;, when the upstream CLI registers a --no-<key> form). Merged with services.llmhop.llama-cpp.models.<name>.settings; per-model entries take precedence.

Type: attribute set of anything

Default:

{ }

services.llmhop.llama-cpp.models

Models to serve. Each entry produces one systemd service running llama-server; the attribute name is the routing key surfaced through llmhop and the OpenAI model field.

GPU selection is done via build-specific environment variables on environment (top-level or per-model), since llama.cpp runs as a host process — no CDI involved. Common variables: CUDA_VISIBLE_DEVICES (CUDA), HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES (ROCm), GGML_VK_VISIBLE_DEVICES (Vulkan), ZE_AFFINITY_MASK (SYCL).

Type: attribute set of (submodule)

Default:

{ }

Example:

{
  "qwen3-8b" = {
    port = 18001;
    settings = {
      hf-repo = "unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL";
      temperature = 1.0;
      top-k = 20;
    };
    # Pin this model to a specific GPU. The right variable depends on
    # the llama.cpp build: CUDA_VISIBLE_DEVICES for CUDA,
    # HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES for ROCm,
    # GGML_VK_VISIBLE_DEVICES for Vulkan, ZE_AFFINITY_MASK for SYCL.
    environment.CUDA_VISIBLE_DEVICES = "0";
  };
}

services.llmhop.llama-cpp.models.<name>.enable

Whether to enable serving of model ‹name›.

Type: boolean

Default:

true

Example:

true

services.llmhop.llama-cpp.models.<name>.environment

Additional environment variables set on this model’s service. Merged with services.llmhop.llama-cpp.environment; per-model entries take precedence.

Type: attribute set of string

Default:

{ }

services.llmhop.llama-cpp.models.<name>.environmentFile

File in KEY=VALUE format forwarded to this model’s service. Loaded after services.llmhop.llama-cpp.environmentFile, so its entries override global ones. Must be readable by the user systemd reads it as.

Type: null or absolute path

Default:

null

services.llmhop.llama-cpp.models.<name>.name

Canonical identifier for this model. Used for the unit name (llama-cpp-<name>) and as the routing key registered with llmhop (clients select the backend by sending this value in the OpenAI model field).

Defaults to the attribute key, so the key itself must match the required label format.

Type: string matching the pattern [[:alnum:]][[:alnum:].-]*

Default:

"‹name›"

services.llmhop.llama-cpp.models.<name>.port

Loopback host port that llama-server binds to. Must be unique per enabled model; the gateway (llmhop) reaches each backend at http://127.0.0.1:<port>.

Type: 16 bit unsigned integer; between 0 and 65535 (both inclusive)

services.llmhop.llama-cpp.models.<name>.settings

CLI flags forwarded to the model server for this model. true collapses to --<key>; null and false are dropped (write the negated key explicitly, e.g. "no-mmap" = true;, when the upstream CLI registers a --no-<key> form). Merged with services.llmhop.llama-cpp.modelSettings; per-model entries take precedence.

Type: attribute set of anything

Default:

{ }

services.llmhop.llama-cpp.openFilesLimit

File descriptor limit (LimitNOFILE) applied to every llama-cpp systemd unit. Increase if the server logs accept: Too many open files under concurrent load.

Type: positive integer, meaning >0

Default:

Keyboard shortcuts

llmhop