55 TOPS Raspberry Pi AI PC - 4 TPUs, 2 NPUs

I'm in full-on procrastination mode with Open Sauce coming up in 10 days and a project I haven't started on for it, so I decided to try building the stable AI PC with all the AI accelerator chips I own:

  • Hailo-8 (26 TOPS)
  • Hailo-8L (13 TOPS)
  • 2x Coral Dual Edge TPU (8+8 = 16 TOPS)
  • 2x Coral Edge TPU (4+4 = 8 TOPS)

After my first faltering attempt in my testing of Raspberry Pi's new AI Kit, I decided to try building it again, but with a more 'proper' PCIe setup, with external 12V power to the PCIe devices, courtesy of an uPCIty Lite PCIe HAT for the Pi 5.

Raspberry Pi 55 TOPS AI Board

I'm... not sure it's that much less janky, but at least I had one board with a bunch of M.2 cards instead of many precariously stacked on top of each other!

Hardware-wise, I have 63 potential TOPS of neural compute available. But only 55 are available, since the Alftel 12x PCIe M.2 adapter card I'm using only supports one lane per slot (the Dual Edge TPU's need two lanes wired up for A+E key—a slightly non-standard M.2 configuration).

None of that's helpful at all if I can't load drivers and access all these NPUs and TPUs. Luckily, I can! Following this guide from MidnightLink I was able to compile the Coral's apex driver on Pi OS 12, and use it with CodeProject.AI.

Upon making the necessary changes for the Coral TPUs, the Hailo accelerators also worked behind the PCIe switch—something that doesn't work out of the box right now due to some PCIe quirks on the Pi 5. Luckily fixing that only requires the addition of an overlay inside /boot/firmware/config.txt:

# Required for the Coral TPUs.
kernel=kernel8.img
dtoverlay=pineboards-hat-ai

# Required for the Hailo, unless using the above overlays for Coral compatibility.
dtoverlay=pciex1-compat-pi5,no-mip

If you're not using a PCIe switch like I am, you don't need to add any of that, except maybe the kernel change for Coral TPUs. And if you're not using a switch (or you have one PCIe Gen 3-rated), you should also try dtparam=pciex1_gen=3 to almost double your bandwidth.

Anyway, once I did all that, I could also use the Hailo for inference, though examples of how to use multiple Hailo's are not easy to find yet. I know the topology of the Hailo-8 Century is very similar to what I've built... just a little less janky. It would be interesting to see full support for multi-NPU setups like this from more software.

EBV Elektronik even demoed running 4x Raspberry Pi cameras through 4 separate Hailo-8 NPUs on a Seaberry board a couple years ago.

With the Hailo-8L available at a more attractive price ($70 in the AI Kit, at least), it's not unreasonable to expect people to hack together systems with multiple NPUs like this. Maybe not 12, though.

I have a video where I go into more detail on my 2nd channel, Level2Jeff:

There are still many caveats, which mean I can't just say "this setup is faster than a Copilot+ PC that has 40 TOPS":

  • Software support for uniting multiple NPUs is a bit lacking. Some things can support it, but it's not as easy as one big accelerator.
  • Hailo hasn't stated exactly how much RAM they have on-chip—but it's probably not that much, limiting the use to smaller models.
  • The Pi's PCIe Gen 2 bus can be uprated to Gen 3 (and in my experience works great at this speed)... but most PCIe switches that aren't extremely expensive are still Gen 2, so you are a bit bandwidth-constrained with this setup.

Comments

Over in a comment on that LinkedIn post I mention above, Gianluca Filippini mentioned it isn't that bad using multiple Hailo NPUs:

HailoRT (and Tappas) provides a scheduler (RoundRobin) to handle multiple data pipelines using the same device. The example from github TAPPAS/multistream-multidevice shows how to use the scheduler on multiple hailo to handle different flows. As you can see this is all "data parallel" mode. AFAIK a "model split" across multiple device is still something more manual to be implemented via HailoRT API.

Jeff,
Keep doing the ridiculous things. It's interesting to watch, and also learn.
I want to be able to use a TPU, or and NPU, and be able to use an SSD; single use OCI appears to be the only flaw of the Raspberry PCI HAT. Running Linux from a micro SSD even with the additional data lane, or three,as the system disk is ridiculous.

Talking about ridiculous stuff, Dustin Sandlin of Smarter Every Day did something that was absolutely ridiculous, but also entertaining: fire two rifles at each other to get the bullets hit each in mid air; which they succeeded. It was also entertaining to see the Cicadas make their buzzing sound in slo-mo.

I'm keenly waiting for a video and/or article about hybrid clusters...

Hey Jeff, any idea if you'll be working to see if these NPUs work with Windows CoPilot? Would be cool to add NPU capabilities to a custom built PCs instead of just on laptops as it currently is to my current understanding.

I don't know yet—right now I'm trying to get my hands on one of Qualcomm's boxes so I can have the 'premium' Windows on Arm device to test with. Then I may fire up my Ampere workstation with Windows 11 again and see what I can do with some PCIe devices.