Recently an Igalia engineer posted a NUMA Emulation patch for the Pi 5 to the Linux Kernel mailing list. He said it could improve performance of Geekbench 6 scores up to 6% for single-core, and 18% for multicore.
My testing didn't quite match those numbers, but I did see a significant and consistent performance increase across both Geekbench 6:
And High Performance Linpack:
If you want to see all the gory details of my test process and setup (and how to replicate the results), check out the issue I posted to my top500 repository: Benchmark Raspberry Pi 5 Linux kernel NUMA patch.
Update August 2024: Until this is in Pi OS proper, you can install the patch by running sudo rpi-update pulls/6273
, see this issue. You can still follow the steps below if you like, but using rpi-update
means you don't have to recompile the kernel :)
Evaluating the patch is a little involved (especially if you're not familiar with compiling the Linux kernel):
- Download the .mbox file for the kernel patch thread.
- Apply it to your raspberrypi/linux checkout with
git am [filename.mbox]
- Rebuild the Linux kernel, ensuring NUMA Emulation is enabled in the kernel config.
- Add
numa=fake=4
to/boot/firmware/cmdline.txt
before therootwait
option, and reboot. - Prefix any commands you want to test with
numactl
, e.g.:numactl --interleave=all ./geekbench6
. (Installnumactl
withsudo apt install -y numactl
.)
It remains to be seen whether the patch will make it in—similar NUMA emulation exists for x86 already, so there is precedent. Otherwise Raspberry Pi could maintain the code in their own Linux fork or pull some of the memory layout changes into firmware, maybe.
Pi 1, 3+ Efficiency gains via s2idle
Separately, Stefan Wahren posted a patch for the Raspberry Pi 1 B, 3 A+, and 3 B+, implementing support for S2Idle on those models.
Suspend-to-idle is a lightweight sleep state a computer can employ to save a little juice while it's not doing much.
In the Pi's case, at least on the Pi 1 B, this results in a 23% power savings while idle:
- running but CPU idle = 1.67 W
- suspend to idle = 1.33 W
The patch doesn't work with reducing the USB bus power draw (due to this issue), but if that could be solved, there may be even more upside in the future.
No word on whether this patch will make it in, but it's being actively reviewed at the time of this writing.
A2 microSD card Command Queueing support (for 2-3x faster random access)
One thing that's actually implemented on the Pi 5 now—no need for a kernel patch review—is A2 microSD card Command Queueing.
To enable it on your Pi 5, make sure you're on the latest update, and add dtparam=sd_cqe
to /boot/firmware/config.txt
and reboot.
If it's working, and you have an A2 card (most older cards are either A1 or not rated at all), then you should see something like the following in dmesg
logs:
mmc0: Command Queue Engine enabled, 31 tags
Check my full test results here, but here's a summary of my testing with both the Raspberry Pi Diagnostics tool:
...and my own disk-benchmark.sh tool using iozone
:
I have a full video on my YouTube channel going over everything in more detail, with a little more explanation, including why I haven't been able to test the NUMA emulation (which aims to be generic for all Arm devices) on Rockchip RK3588 boards:
Comments
Hello there. I tried to apply the .mbox patch, and I also looked into the setup details you put on the github issue you linked here. At the moment of applying, however, git gave me "empty patch" errors; i tried enabling NUMA emulation and compiling the kernel anyway, and added the lines to /boot/firmware/config.txt and /boot/firmware/cmdline.txt, but numactl still tells me that there is no NUMA emulation available. Did I miss some steps?
try running `sudo rpi-update pulls/6273` to install firmware with the numa feature. here's a link to the github issue, https://github.com/raspberrypi/firmware/issues/1854#issuecomment-226572…
Thanks! I'll add a note in the blog post.
Let's talk about the figure - Geekbench 6.
As you can see, the single core test gives 800, but the multicore test gives 2-times higher power against the single core test. We would expect 4-times higher values. My suspicion of root cause is low memory bandwidth due to this SoC has only single channel RAM, and the RAM has only 32bit width of data bus. If it was 64bit data bus of the RAM chip, we could have full load of CPU with 4 CPU cores. Of course, it depends on L3 speed as well, and the speed of the internal CPU bus.
If the RPi6 could have 8 cores CPU, then we would need to have 128-bit data bus of RAM and a little higher clock of the RAM.
I think this patch with NUMA has nothing to do with RPi because RPi board contains only one socket of CPU. The NUMA functionality is related to multi-socket CPU boards. This means the improvement here has to do with a side effect.
The real improvement must go with dual channel RAM, 64bit and 128bit width data bus of RAM, unlike the current 32bit width of data bus.