local llms in emacs with strix halo

problem

I got an HP ZBook Ultra G1a for a new laptop. The main draw being the strix halo processor it comes with. My particular machine has the AMD Ryzen™ AI Max+ PRO 395 and 64Gb of unified RAM. The idea being to run some relatively large local LLMs at a usable speed.

But, while powerful on paper, the software stack isn’t quite there yet to take advantage of the strix halo hardware. The latest versions of ROCm, for example, don’t support the processor line and aren’t readily available on Fedora 42 (which works perfectly on the Zbook Ultra).

So, the challenge is to find the best way to make the most of the hardware with what’s available. And, as a bonus, integrate the solution with gptel to use local models in Emacs.

solution

Jeff Geerling explored some interesting solutions working on a cluster of Framework mini PC mainboards.

But, the silver bullet – especially on Fedora – is kyuz0’s amd-strix-halo-toolboxes. It offers a bunch of containers that ship with precompiled versions of llama.cpp for every backend (i.e. vulkan and rocm). Making it as simple as typing toolbox create and toolbox enter to run some models with impressive performance. Kyuz0 also has a helpful interactive performance comparison tool to pick the backend that best suits the model(s) you’re looking to run.

Personally, I use the vulkan-radv backend for its superior performance running Qwen3 models.

Just note that the host configuration suggested on the github contains an inaccuracy supposedly. Apparently, you need to set the ttm.pages_limit to 33554431 instead of the listed 33554432 (which I did).

running multiple models

While performance is better, one drawback of llama.cpp over ollama is its inability to easily swap between models – to easily unload a running model and load a new one.

That’s where llama-swap comes in. With a single yaml file you can manage multiple models behind a proxy. The best part? It exposes OpenAI API supported endpoints which comes in handy when getting other tools and software to interact with it. In this case, Emacs.

The actual config is really straightforward too:

models:
  "qwen-coder-32B":
    proxy: "http://127.0.0.1:8999"
    cmd: >
      /usr/bin/llama-server
      --host 127.0.0.1 --port 8999 --flash-attn --slots
      --no-mmap
      -ngl 999
      -m  /home/j/models/qwen3-coder-30B-A3B/Qwen3-30B-A3B-Instruct-2507-Q8_0.gguf

  "qwen-instruct-32B":
    proxy: "http://127.0.0.1:9251"
    cmd: >
      /usr/bin/llama-server
      --host 127.0.0.1 --port 9251 --flash-attn --slots
      --no-mmap
      -ngl 999
      -m  /home/j/models/qwen3-coder-30B-A3B/Qwen3-30B-A3B-Instruct-2507-Q8_0.gguf

models inside emacs

gptel is probablby the most fleshed out LLM client for Emacs. Setting it up to work with llama-swap is actually straightforward if not a bit counterintuitive at first. The trick is to setup an OpenWebUI backend instead of a LLama.cpp one, targetting llama-swap’s /v1/chat/completions/ endpoint.

In practice, assuming llama-swap is running on port 8080, this looks something like in your config:

(gptel-make-openai "llama-swap"
  :host "localhost:8080"
  :protocol "http"
  :endpoint "/v1/chat/completions"
  :stream t
  :models '(qwen-coder-32B qwen-instruct-32B))

(setq gptel-backend (gptel-get-backend "llama-swap")
      gptel-model 'qwen-instruct-32B)

all together

The beauty of running llama.cpp in a container is that you can just start llama-swap with a single command without entering the container. So the same command should work across different containers and backends as they evolve over time:

toolbox run --container llama-vulkan-radv ~/builds/llama-swap --config ~/.config/llama-swap/config.yaml --listen localhost:8080

That’s it!

All that’s left is maybe automating the running of the container so that llama-swap is always available when Emacs launches.