Running darktable on RISC-V
A while ago I got hold of a cheap Sipeed Lichee RV RISC-V development board. After finally getting it up and running, I wondered if and how well darktable would work on RISC-V? The answer is: surprisingly well, if the hardware is fast enough…
The Sipeed Lichee RV board
This is basically the slowest and cheapest Linux-capable RISC-V board you can currently get. The base board has an Allwinner D1 SoC with a single-core XuanTie C906 64-bit RISC-V processor core clocked at 1.0 GHz, 512 MB or 1 GB of DDR3 RAM, a 4K-capable GPU, a microSD card slot for storage and an USB-C port. The single core is supposed to be a little bit faster than the ARM core in the original Raspberry Pi Zero. CPU identification doesn’t tell us much:
sipeed@sipeed:~$ lscpu Architecture: riscv64 Byte Order: Little Endian CPU(s): 1 On-line CPU(s) list: 0 sipeed@sipeed:~$ cat /proc/cpuinfo processor : 0 hart : 0 isa : rv64imafdc mmu : sv39 uarch : thead,c906
In contrary to pretty much all other Single-Board Computers (SBCs), the base board doesn’t have any connectivity options besides USB-C. No HDMI, no WiFI, nothing. Theoretically it should be possible to get everything up and running with the base board alone, by connecting an USB Ethernet dongle to the USB-C port and supplying power via two separate pins. But that’s something for experts, so in most cases you’ll want the additional dock. It adds a WiFi chip, an additional USB-A port, a full-size HDMI port, a pin header row and some other ports more important for embedded devices.
The Sipeed Lichee RV base board sells for about 21 € on AliExpress, with the additional dock and shipping it’s about 36 €. Which is a lot more than a Raspberry Pi Zero W, but still much less than the two other RISC-V development boards currently available (Sipeed Nezha, >130 €, and Sipeed VisionFive, about 200 €). Better having a board than not having one at all. There are several additional development boards coming, like the Sipeed VisionFive 2 (Kickstarter offers starting at about 60 € for a four-core Starfive JH7110 SoC with 4 GB of RAM, including taxes and shipping) and the Pine64 Star64 (same Starfive JH7110 SoC, even has a PCIe port, price expected to be around 60-80 €), but these won’t be shipped before the end of 2022.
I cannot recommend getting the Sipeed Lichee RV board. Performance is very bad in all regards, I get about 10 MByte/s reading from my fastest microSD card and 800 kByte/s transferring data via SFTP. The WiFi chip on the dock doesn’t have a proper antenna, it only picks up a signal if the hotspot is very close, so I had to attach an USB Ethernet dongle. The price was okay-ish when there were no better options, but much better ones will be available soon. Also the software and the community are not very well-developed. I had to make my own operating system image because the official image was outdated and the kernels oft most alternative images didn’t support USB Ethernet dongles. I also don’t hear much good from the Nezha and VisionFive boards, apparently they have electrical issues and don’t reliably boot from the SD cards.
If you are looking into RISC-V, wait for the Pine64 Star64. At least community support will definitely be much better than anything Sipeed can offer, and the board will have a PCIe slot, which can be used to attach an NVMe SSD or other goodies.
Building darktable on RISC-V
darktable is my favourite converter software for Raw files. It has a lot of optimizations for various CPU architectures, CPU features and also supports GPU with OpenCL. This also means that it doesn’t just let you compile the source code on everything you have and the wait for the compiler errors. Compiling it on RISC-V fails immediately due to the strict CPU support macros.
Luckily this is rather easy to fix. The following patch works against the 4.0.0 stable release and all git commits up to at least ab7e374330a9e50abad0f2784bda4b319e770239 (Fri Aug 19 09:58:25 2022 +0200):
diff --git a/src/is_supported_platform.h b/src/is_supported_platform.h index 165f071a5..b7afc8b0c 100644 --- a/src/is_supported_platform.h +++ b/src/is_supported_platform.h @@ -42,14 +42,21 @@ #define DT_SUPPORTED_PPC64 0 #endif +#if (defined(__riscv) || defined(__riscv__)) && (__riscv_xlen==64) +#define DT_SUPPORTED_RISCV64 1 +#else +#define DT_SUPPORTED_RISCV64 0 +#endif + #if DT_SUPPORTED_X86 && DT_SUPPORTED_ARMv8A #error "Looks like hardware platform detection macros are broken?" #endif -#if !DT_SUPPORTED_X86 && !DT_SUPPORTED_ARMv8A && !DT_SUPPORTED_PPC64 -#error "Unfortunately we only work on amd64, ARMv8-A and PPC64 (64-bit little-endian only)." +#if !DT_SUPPORTED_X86 && !DT_SUPPORTED_ARMv8A && !DT_SUPPORTED_PPC64 && !DT_SUPPORTED_RISCV64 +#error "Unfortunately we only work on amd64, ARMv8-A, PPC64 (64-bit little-endian only) and RISC-V (64-bit only)" #endif +#undef DT_SUPPORTED_RISCV64 #undef DT_SUPPORTED_PPC64 #undef DT_SUPPORTED_ARMv8A #undef DT_SUPPORTED_X86
After this the source builds with the standard gcc 12.1.0 that comes with the current Debian Sid most images are based on. The only difference to the normal process is that we have to manually disable all the OpenCL stuff. I also found 3 to be a good level of concurrency on the Lichee RV, and GCC 12 actually even supports a tuning option specifically for the XuanTie C906 CPU:
$ CFLAGS="-mtune=thead-c906" CXXFLAGS="${CFLAGS}" cmake -DHAVE_OPENCL=Off -DTESTBUILD_OPENCL_PROGRAMS=Off .. $ make -j3
Compiling took 297 minutes, 50.544 seconds. So pretty much five hours.
Running darktable on RISC-V
I wanted darktable to have access to the full resources of the board, so I disabled the running LXDE desktop and used X11 Forwarding to my workstation for graphical output.
At startup, darktable emits a number of errors and warnings pertaining to the unknown CPU architecture. These are not critical, it just means all the optimized codepaths are being disabled and the generic (slow) ones used instead.
[dt_detect_cpu_features] Not implemented for this architecture. [dt_detect_cpu_features] Please contribute a patch. [dt_init] SSE2 instruction set is unavailable. [dt_init] expect a LOT of functionality to be broken. you have been warned. [dt_detect_cpu_features] Not implemented for this architecture. [dt_detect_cpu_features] Please contribute a patch. [dt_codepaths_init] will be using experimental plain OpenMP SIMD codepath.
I had a look at the dt_detect_cpu_features
function to check what would be missing to add RISC-V support. There currently isn’t anything to do here, since there would be no code that would use the result of a CPU feature detection on RISC-V.
Apart from the abysmal performance, darktable works exactly as expected on RISC-V. During the first couple of tries it would often crash with the following error message, but this hasn’t happened for a while now, so I guess it’s something that has been fixed in glibc/gcc/etc.
Inconsistency detected by ld.so: dl-runtime.c: 77: _dl_fixup: Assertion `ELFW(R_TYPE)(reloc->r_info) == ELF_MACHINE_JMP_SLOT' failed!
Performance comparison
Speaking of performance: It really is abysmal. Raw converters are never the fastest image editing tools, since they process everything with 32 or 64 bit floating point numbers internally. darktable puts particular emphasis on precision. The following measurements were generated with the exact same darktable profile on all devices, using the same 45.7 megapixel 14-bit Raw file taken with my Nikon D850 (the picture visible in the feature image of this post) and by running darktable-cli
to remove the overhead of the GUI (where possible). The edits to this picture use a rather standard set of processing modules.
My workstation has a Ryzen 5900X CPU and most of the processing is offloaded to a Radeon RX 6600 XT GPU using OpenCL. Generating the preview thumbnail takes about 0.116 seconds. Exporting it at 7 MP (3240×2160 pixels) resolution, something I use all the time for full-size previews on my 4K screens, takes about 1.5 seconds. Exporting the image at full 45.7 MP resolution (8288×5520 pixels) takes a couple of seconds.
On the Lichee RV, generating the thumbnail already takes 69.107 seconds, so about 700 times as long as on the Ryzen/Radeon system. Exporting the picture at full resolution takes 578 minutes and 35.126 seconds, a full nine hours…
Okay, this comparison was maybe extreme, so let’s make it more realistic and use my AMD Ryzen 5 4700U laptop instead. 6 cores, no OpenCL. Generating the thumbnail is in the 0.1 second ballpark as on the Ryzen 5900X, exporting at 7 MP takes 4.74 seconds and exporting at 45.7 MP 30.616 seconds.
So the Lichee RV is about 4000 times slower than the Radeon RX 6600 XT and about 1200 times slower than the Ryzen 5 4700U. But it works 🙂
Pingback: Links 25/08/2022: Tails 5.4 and EasyOS 4.3.5 | Techrights
Looking at the Ryzen 5 4700U we’ve got 8 cores at 4+ GHz, Out-of-Order dataflow engine with 4 wide decode (up to 8 wide execute)
Vs 1 core at 1 GHz, in-order single decode / single execute.
1200/8/4/4 = 9.375
And then we’re left with I assume AVX vs scalar computation in the integer or floating point registers. I don’t know what precision it is using, but if it’s 16 bit integers in 256 bit SIMD vs 1 at a time … that’s a factor of 16.
So it’s no surprise at all, just looking at the architectures.
You should try it on a Pi Zero. That would be a more interesting comparison.
The 4700U has 6 cores and darktable uses FP32, but of course.
I think I have a Raspberry Pi 3 somewhere.