LLMs accelerated with eGPU on a Raspberry Pi 5

After a long journey getting AMD graphics cards working on the Raspberry Pi 5, we finally have a stable patch for the amdgpu Linux kernel driver, and it works on AMD RX 400, 500, 6000, and (current-generation) 7000-series GPUs.

With that, we also have stable Vulkan graphics and compute API support.

When I wrote about getting a Radeon Pro W7700 running on the Pi, I also mentioned AMD is not planning on supporting Arm with their ROCm GPU acceleration framework. At least not anytime soon.

Luckily, the Vulkan SDK can be used in its place, and in some cases even outperforms ROCm—especially on consumer cards where ROCm isn't even supported on x86!

Installing llama.cpp with Vulkan support on the Pi 5

Assuming you already have an AMD graphics card (I tested with an RX 6700 XT), and you built a custom kernel using our amdgpu patch (instructions here), you can compile llama.cpp on the Pi 5 with Vulkan support:

# Install dependencies: Vulkan SDK, glslc, and cmake
sudo apt install -y libvulkan-dev glslc cmake

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with Vulkan support
cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release

Now, you can download a model (e.g. off HuggingFace), and test to ensure llama.cpp is using the GPU to accelerate inference:

# Download llama3.2:3b
cd models && wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf

# Run it.
cd ../
./build/bin/llama-cli -m "models/Llama-3.2-3B-Instruct-Q4_K_M.gguf" -p "Why is the blue sky blue?" -n 50 -e -ngl 33 -t 4

# You should see in the output, ggml_vulkan detected your GPU. For example:
# ggml_vulkan: Found 1 Vulkan devices:
# ggml_vulkan: 0 = AMD Radeon RX 6700 XT (RADV NAVI22) (radv) | uma: 0 | fp16: 1 | warp size: 64

You can also monitor the GPU statistics with tools like nvtop (sudo apt install -y nvtop) or amdgpu_top (build instructions).

Note: I'd like to thank GitHub user @0cc4m especially for help getting this working, along with others who've contributed to the issues over on my Pi PCIe project!

On my RX 6700 XT, I can confirm the model gets loaded into the VRAM and the GPU is used for inference:

nvtop RX 6700 XT showing VRAM usage on Pi 5

Performance

I went to Micro Center and bought a couple more consumer graphics cards for testing, and matched that up with the cards I already own, as well as my M1 Max Mac Studio, which has 64 GB of shared RAM and 24 GPU cores:

llama.cpp inference speeds on Pi 5 and M1 Max Mac Studio

I tested a variety of models—including some not pictured here, like Mistral Small Instruct (a 22 Billion parameter model), and Qwen2.5 (a 14 Billion parameter model). Some models had to split between the Pi's pokey CPU and the GPU, while others could fit entirely on the GPU.

The amdgpu driver patch translates memory access inefficiently in many cases, and I think that's what kills performance with larger models.

But for smaller models—ones that are targeted at client devices and consumer GPUs—the Pi and Vulkan doesn't seem to be much of a bottleneck!

And as pointed out on Reddit, the main virtue of this system as opposed to any old PC with a graphics card is idle power efficiency:

llama.cpp system Pi 5 idle power draw

The Pi only consumes 3W of power at idle, and if you pair it with an efficient graphics card and PSU, the entire setup only uses 10-12W of power when it's not actively running a model!

I see plenty of AMD and Intel systems that burn that much power just in the CPU, not accounting for the rest of the system.

Goals

I am a bit of an 'AI skeptic'. I still prefer we call it machine learning and LLMs, instead of 'AI chatbots' and stuff like that—those are marketing words. I'm also concerned the AI bubble is still inflating, and the higher it goes, the worse the fallout will be.

However, I do see some great use cases—ones made easier when you can build a tiny, compact, power-sipping LLM runner. Future CM5 + GPU dock, anyone?

For me, the three things I can see one of these builds doing are:

  • Faster, local text-to-speech and speech-to-text transcoding (for Home Assistant Voice Control)
  • Useful AI 'rubber duck' sessions (I can bounce an idea off an AI model—kind of like a tiny local Google search index without the first page of results all being ads)
  • Reducing the inexorably-large footprint of LLMs running everywhere all the time. If you're running a homelab on a Dell R720, not only are you likely going deaf over time, it's eating up a lot of power... a small, quiet setup for LLMs is good, IMO.

Pi 5 llama.cpp RX 6700 XT setup

The Pi 5 setup I have is about $700 new, and could be down to $300-400 if you use a used graphics card or one you already own. Here's my exact setup (some links are affiliate links):

If Raspberry Pi built a Pi 5 with 16 GB of VRAM, some larger models may be more feasible. We also can still optimize the amdgpu driver patch further, but follow my Pi PCIe project for more on that.

All my test data and benchmarks are in this issue on GitHub.

Comments

Hi Jeff,

I was more interested in running llm on the GPU and than running games. I liked what you folks have contributed and would like to try this. Just out of curiosity.
I was planning to buy the RX6500 XT. In the performance chart above what size Graphics ram did you use for the RX6500 XT? Also does the brand matter? Like MSI or Gigabyte
I was assuming it would be 4GB, assuming the model that you were running is about 2.2GB..
I am noob in gpus and would like to learn. You might know me as "smart home circle" from twitter :)

I bought an ASRock 8 GB model, brands don't matter too much for the most part, but some have better or worse coolers.

Thanks for this Jeff.

So when you run the model, the model gets loaded in the GRAM. If the model size was like 2 GB, can the memory consumption go above 2 times that of that? I am just curious to know this because I wanted to decide if I can buy a 4GB version that is widely available on ebay and cheaper than the 8 GB version.

Dear Jeff, this is so great. Thank you very much for everything you’re doing. And now this! You got me thinking NOT to buy this bloated gaming PC, but expand my RPi 5 the way you did. One question: you’re now using the eGPU for text-to-text LLM generation. Could the same set-up also be used for text-to-image generation on the eGPU, e.g., using Stable Diffusion?

Nice! Would you say that your whole Raspberry Pi gaming setup is more expensive than a full PlayStation or Xbox setup would be? Is the speed and FPS similar? What about that in comparison to a gaming PC?

Hi Jeff,

Same guy as in your other post (issues with RX6700XT, WX3100 working)

Wanted to give a quick head up in case anyone else tries to make this work : at the time of this writing, there is an issue with llama.cpp and it won't build. Here's the link to the github issue. There's a link to someone's branch which is working in the thread, I just build it successfully.

I ran the same test as you and the WX3100 isn't that bad!

rpi5:~/llama.cpp $ ./build/bin/llama-cli -m "models/Llama-3.2-3B-Instruct-Q4_K_M.gguf" -p "Why is the blue sky blue?" -n 50 -e -t 4 -ngl 33
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Pro WX 3100 (RADV POLARIS12) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
...

Why is the blue sky blue? The answer is not that simple. The color of the sky appears blue because of a phenomenon called Rayleigh scattering. This is named after the British physicist Lord Rayleigh, who in 1871 explained the scattering of light by small particles. Rayleigh

llama_perf_sampler_print:    sampling time =      12.46 ms /    58 runs   (    0.21 ms per token,  4654.52 tokens per second)
llama_perf_context_print:        load time =    4034.41 ms
llama_perf_context_print: prompt eval time =     251.03 ms /     8 tokens (   31.38 ms per token,    31.87 tokens per second)
llama_perf_context_print:        eval time =    2899.79 ms /    49 runs   (   59.18 ms per token,    16.90 tokens per second)
llama_perf_context_print:       total time =    3275.01 ms /    57 tokens

thanks for your work :)