My #IWOCL 2025 Keynote presentation is online!
Scaling up #FluidX3D #CFD beyond 100 Billion cells on a single computer - a story about the true cross-compatibility of #OpenCL
https://www.youtube.com/watch?v=Sb3ibfoOi0c&list=PLA-vfTt7YHI2HEFrpzPhhQ8PhiztKhHU8&index=1
Slides: https://www.iwocl.org/wp-content/uploads/iwocl-2025-moritz-lehmann-keynote.pdf
I just uploaded the 5000th #OpenCL hardware report to @sascha's gpuinfo.org database! And guess what #GPU I reserved the spot for: #Intel Arc B580 #Battlemage
https://opencl.gpuinfo.org/displayreport.php?id=5000
I have contributed 4.2% (211) of all entries.
If you're using darktable 5.01, the latest update on the Fedora Linux repo (flatpak), may have just added OpenCL support for Radeon GPUs (at least it did for my RX 6600).
The Flathub version doesn't seem to add OpenCL (currently), so it may be a Fedora thing.
I have not installed the RPM version so far, so I'm not sure about that package.
I’m thinking of #compiling #darktable from source so that it’s better optimized for my processor.
Anybody experience with its potential? #question #followerpower
I’m generally ok with how fast the flatpak runs on my i7-1255 laptop. However, with such an iterative workflow, I feel that one has much to gain with slight improvements via #opencl and AVX.
What an honor to start the #IWOCL conference with my keynote talk! Nowhere else you get to talk to so many #OpenCL and #SYCL experts in one room! I shared some updates on my #FluidX3D #CFD solver, how I optimized it at the smallest level of a single grid cell, to scale it up on the largest #Intel #Xeon6 #HPC systems that provide more memory capacity than any #GPU server.
I'm liking the class this year. Students are attentive and participating, and the discussion is always productive.
We were discussing the rounding up of the launch grid in #OpenCL to avoid the catastrophic performance drops that come from the inability to divide the “actual” work size by anything smaller than the maximum device local work size, and were discussing on how to compute the “rounded up” work size.
The idea is this: given the worksize N and the local size L, we have to round N to the smallest multiple of L that is not smaller than N. This effectively means computing D = ceili(N/L) and then using D*L.
There are several ways to compute D, but on the computer, working only with integers and knowing that integer division always rounded down, what is the “best way”?
D = N/L + 1 works well if N is not a multiple of L, but gives us 1 more than the intended result if N *is* a multiple of L. So we want to add the extra 1 only if N is not a multiple. This can be achieved for example with
D = N/L + !!(N % L)
which leverages the fact that !! (double logical negation) turns any non-zero value into 1, leaving zero as zero. So we round *down* (which is what the integer division does) and then add 1 if (and only if) there is a reminder to the division.
This is ugly not so much because of the !!, but because the modulus operation % is slow.
1/n
AMD Radeon GPU Analyzer (RGA) is our performance analysis tool for #DirectX, #Vulkan, SPIR-V, #OpenGL, & #OpenCL.
As well as updates for AMD RDNA 4, there's enhancements to the ISA view UI, using the same updated UI as RGP
More detail: https://gpuopen.com/learn/rdna-cdna-architecture-disassembly-radeon-gpu-analyzer-2-12/?utm_source=mastodon&utm_medium=social&utm_campaign=rdts
(5/7)
Here's my #OpenCL implementation: https://github.com/ProjectPhysX/FluidX3D/blob/master/src/kernel.cpp#L1924-L1993
#FluidX3D #CFD v3.2 is out! I've implemented the much requested #GPU summation for object force/torque; it's ~20x faster than #CPU #multithreading.
Horizontal sum in #OpenCL was a nice exercise - first local memory reduction and then hardware-supported atomic floating-point add in VRAM, in a single-stage kernel. Hammering atomics isn't too bad as each of the ~10-340 workgroups dispatched at a time does only a single atomic add.
Also improved volumetric #raytracing!
https://github.com/ProjectPhysX/FluidX3D/releases/tag/v3.2
My OpenCL-Benchmark now uses the dp4a instruction on supported hardware (#Nvidia Pascal, #Intel #Arc, #AMD RDNA, or newer) to benchmark INT8 tghroughput.
dp4a is not exposed in #OpenCL C, but can still be used via inline PTX assembly and compiler pattern recognition. Even Nvidia's compiler will turn the emulation implementation into dp4a, but in some cases does so with a bunch of unnecessary shifts/permutations on inputs, so better use inline PTX directly.
https://github.com/ProjectPhysX/OpenCL-Benchmark/releases/tag/v1.8
Other things I have tested with FreeBSD: OpenCL with Discrete GPU via PyOpenCL lib
aaah, nothing can beat the feel of beefed up FreeBSD with working dGPU.
1. OpenCL ✓
2. OBS RenderD129 ✓
Thanks to @vermaden for pointing my fault.
#NVIDIA #GeForce #RTX5090 #Linux #GPU Compute Performance #Benchmarks
When taking geo mean across 60+ benchmarks of #CUDA / #OptiX / #OpenCL / #Vulkan Compute, the GeForce RTX 5090 was delivering 1.42x the performance of GeForce #RTX4090. On performance-per-Watt GeForce RTX 5090 tended to deliver similar power efficiency to the RTX 4080/4090 graphics cards.
GeForce RTX 5090 Founders Edition was running cooler than many of the other Founders Edition graphics cards tested.
https://www.phoronix.com/review/nvidia-geforce-rtx5090-linux
@BenjaminHCCarr another article on #GPU code portability where people put their heads in the sand and pretend very hard that #OpenCL doesn't exist...
OpenCL has solved #GPGPU cross-compatibility 16 years ago already and today is in better shape than ever.
A comparison of HPC-based quantum computing simulators using Quantum Volume
Finally found the downtime to complete this fantastic survey of managed runtimes (e.g. the JVM) and heterogeneous hardware (e.g. CPUs and GPUs or FPGAs) by @snatverk@mastodon.online, @thanos_str@mastodon.sdf.org, and @kotselidis@mastodon.online.
Required reading for those who want a look at the future of software development.
#TornadoVM #JOCL #OpenCL #CUDA
(comment on "Programming Heterogeneous Hardware via Managed Runtime Systems")