Mastodon.ie @mastodonie

1 post1 participant0 posts today

**Dr. Moritz Lehmann** @ProjectPhysX@mast.hpc.social · 2d *

Dr. Moritz Lehmann @ProjectPhysX@mast.hpc.social

I just uploaded the 5000th #OpenCL hardware report to @sascha's gpuinfo.org database! And guess what #GPU I reserved the spot for: #Intel Arc B580 #Battlemage
https://opencl.gpuinfo.org/displayreport.php?id=5000
I have contributed 4.2% (211) of all entries.

opencl.gpuinfo.orgIntel(R) Arc(TM) B580 Graphics - OpenCL Hardware Database by Sascha Willems

**Lukas Weidinger** @lukasweidinger@gruene.social · Apr 12

Apr 12

Lukas Weidinger @lukasweidinger@gruene.social

I’m thinking of #compiling #darktable from source so that it’s better optimized for my processor.
Anybody experience with its potential? #question #followerpower

I’m generally ok with how fast the flatpak runs on my i7-1255 laptop. However, with such an iterative workflow, I feel that one has much to gain with slight improvements via #opencl and AVX.

**Dr. Moritz Lehmann** @ProjectPhysX@mast.hpc.social · Apr 10

Apr 10

Dr. Moritz Lehmann @ProjectPhysX@mast.hpc.social

What an honor to start the #IWOCL conference with my keynote talk! Nowhere else you get to talk to so many #OpenCL and #SYCL experts in one room! I shared some updates on my #FluidX3D #CFD solver, how I optimized it at the smallest level of a single grid cell, to scale it up on the largest #Intel #Xeon6 #HPC systems that provide more memory capacity than any #GPU server.

**Dr. Moritz Lehmann** @ProjectPhysX@mast.hpc.social · Apr 8

Apr 8

Dr. Moritz Lehmann @ProjectPhysX@mast.hpc.social

Just arrived in wonderful Heidelberg, looking forward to present the keynote talk at #IWOCL tomorrow!! See you there!
https://www.iwocl.org/ #OpenCL #SYCL #FluidX3D #GPU

me in Heidelberg with the Neckar river in the background

**Giuseppe Bilotta** @giuseppebilotta@fediscience.org · Mar 25 *

Mar 25 *

Giuseppe Bilotta @giuseppebilotta@fediscience.org

I'm liking the class this year. Students are attentive and participating, and the discussion is always productive.

We were discussing the rounding up of the launch grid in #OpenCL to avoid the catastrophic performance drops that come from the inability to divide the “actual” work size by anything smaller than the maximum device local work size, and were discussing on how to compute the “rounded up” work size.

The idea is this: given the worksize N and the local size L, we have to round N to the smallest multiple of L that is not smaller than N. This effectively means computing D = ceili(N/L) and then using D*L.

There are several ways to compute D, but on the computer, working only with integers and knowing that integer division always rounded down, what is the “best way”?

D = N/L + 1 works well if N is not a multiple of L, but gives us 1 more than the intended result if N *is* a multiple of L. So we want to add the extra 1 only if N is not a multiple. This can be achieved for example with

D = N/L + !!(N % L)

which leverages the fact that !! (double logical negation) turns any non-zero value into 1, leaving zero as zero. So we round *down* (which is what the integer division does) and then add 1 if (and only if) there is a reminder to the division.

This is ugly not so much because of the !!, but because the modulus operation % is slow.

1/n

**mirror::box::milo** @mirrorboxmilo@mstdn.social · Mar 24

Mar 24

mirror::box::milo @mirrorboxmilo@mstdn.social

GIF

#generativeart #artificiallife #reactiondiffusion

**Dr. Moritz Lehmann** @ProjectPhysX@mast.hpc.social · Mar 22 *

Mar 22 *

Dr. Moritz Lehmann @ProjectPhysX@mast.hpc.social

I got access to @LRZ_DE's new coma-cluster for #OpenCL benchmarking and experimentation
I've added a ton of new #FluidX3D #CFD #GPU/#CPU benchmarks:
https://github.com/ProjectPhysX/FluidX3D?tab=readme-ov-file#single-gpucpu-benchmarks

Notable hardware configurations include:
- 4x H100 NVL 94GB
- 2x Nvidia L40S 48GB
- 2x Nvidia A2 15GB datacenter toaster
- 2x Intel Arc A770 16GB
- AMD+Nvidia SLI abomination consisting of 3x Instinct MI50 32GB + 1x A100 40GB
- AMD Radeon 8060S (chonky Ryzen AI Max+ 395 iGPU with quad-channel RAM) thanks to @cheese

The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs and CPUs via OpenCL. Free for non-commercial use. - ProjectPhysX/FluidX3D

GitHubGitHub - ProjectPhysX/FluidX3D: The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs and CPUs via OpenCL. Free for non-commercial use.The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs and CPUs via OpenCL. Free for non-commercial use. - ProjectPhysX/FluidX3D

**Microfractal** @Microfractal@mathstodon.xyz · Mar 14

Mar 14

Microfractal @Microfractal@mathstodon.xyz

My first good #OpenCL #Mandelbrot #fractal using #perturbation. (Separated from the fragment shader, which does the coloring of the computed iterations.)

Next step is a formula parser, which generates opencl-code, which can be compiled at runtime.

$A screenshot of my fractal generator, which shows some GUI-Elements and a square image of the Mandelbrot set.$

#fractalfriday

Continued thread

**GPUOpen** @gpuopen@mastodon.gamedev.place · Mar 11

Mar 11

GPUOpen @gpuopen@mastodon.gamedev.place

AMD Radeon GPU Analyzer (RGA) is our performance analysis tool for #DirectX, #Vulkan, SPIR-V, #OpenGL, & #OpenCL.

As well as updates for AMD RDNA 4, there's enhancements to the ISA view UI, using the same updated UI as RGP

More detail: https://gpuopen.com/learn/rdna-cdna-architecture-disassembly-radeon-gpu-analyzer-2-12/?utm_source=mastodon&utm_medium=social&utm_campaign=rdts
(5/7)

AMD GPUOpenReading AMD RDNA™ and CDNA™ Architecture Disassembly is a Breeze with AMD Radeon™ GPU Analyzer v2.12Explore Radeon GPU Analyzer v2.12's enhanced ISA disassembly view, making shader and kernel analysis easier with new tooltips and automatic highlighting. Optimize your GPU workflows effortlessly!

Continued thread

**Dr. Moritz Lehmann** @ProjectPhysX@mast.hpc.social · Mar 9 *

Mar 9 *

Dr. Moritz Lehmann @ProjectPhysX@mast.hpc.social

Here's my #OpenCL implementation: https://github.com/ProjectPhysX/FluidX3D/blob/master/src/kernel.cpp#L1924-L1993

**Dr. Moritz Lehmann** @ProjectPhysX@mast.hpc.social · Mar 9

Mar 9

Dr. Moritz Lehmann @ProjectPhysX@mast.hpc.social

#FluidX3D #CFD v3.2 is out! I've implemented the much requested #GPU summation for object force/torque; it's ~20x faster than #CPU #multithreading.
Horizontal sum in #OpenCL was a nice exercise - first local memory reduction and then hardware-supported atomic floating-point add in VRAM, in a single-stage kernel. Hammering atomics isn't too bad as each of the ~10-340 workgroups dispatched at a time does only a single atomic add.
Also improved volumetric #raytracing!
https://github.com/ProjectPhysX/FluidX3D/releases/tag/v3.2

FluidX3D simulation of the X-wing with velocity raytracing visualization

FluidX3D simulation of the X-wing with density raytracing visualization

Replied in thread

**Dr. Moritz Lehmann** @ProjectPhysX@mast.hpc.social · Mar 6

Mar 6

Dr. Moritz Lehmann @ProjectPhysX@mast.hpc.social

@hpcnotes the very first actually CPU multithreading in #Java.
Then #OpenCL in Java through aparapi.
And finally OpenCL with C++ and stuck with that ever since.

**Dr. Moritz Lehmann** @ProjectPhysX@mast.hpc.social · Mar 3 *

Mar 3 *

Dr. Moritz Lehmann @ProjectPhysX@mast.hpc.social

Hot Aisle's 8x AMD #MI300X server is the fastest computer I've ever tested in #FluidX3D #CFD, achieving a peak #LBM performance of 205 GLUPs/s, and a combined VRAM bandwidth of 23 TB/s.
The #RTX 5090 looks like a toy in comparison.

MI300X beats even Nvidia's GH200 94GB. This marks a very fascinating inflection point in #GPGPU: #CUDA is not the performance leader anymore.
You need a cross-vendor language like #OpenCL to leverage its power.

FluidX3D on #GitHub: https://github.com/ProjectPhysX/FluidX3D

FluidX3D benchmarks: the 8x AMD MI300X system leaves every other benchmarked computer behind in the dust.

**HGPU group** @hgpu@mast.hpc.social · Mar 3

Mar 3

HGPU group @hgpu@mast.hpc.social

pyATF: Constraint-Based Auto-Tuning in Python

#OpenCL #CUDA #Performance #AutoTuning #Compilers #Python #Package

https://hgpu.org/?p=29798

hgpu.org · Mar 3pyATF: Constraint-Based Auto-Tuning in PythonWe introduce pyATF – a new, language-independent, open-source auto-tuning tool that fully automatically determines optimized values of performance-critical program parameters. A major feature…

**Dr. Moritz Lehmann** @ProjectPhysX@mast.hpc.social · Feb 22

Feb 22

Dr. Moritz Lehmann @ProjectPhysX@mast.hpc.social

My OpenCL-Benchmark now uses the dp4a instruction on supported hardware (#Nvidia Pascal, #Intel #Arc, #AMD RDNA, or newer) to benchmark INT8 tghroughput.
dp4a is not exposed in #OpenCL C, but can still be used via inline PTX assembly and compiler pattern recognition. Even Nvidia's compiler will turn the emulation implementation into dp4a, but in some cases does so with a bunch of unnecessary shifts/permutations on inputs, so better use inline PTX directly.
https://github.com/ProjectPhysX/OpenCL-Benchmark/releases/tag/v1.8

dp4a implementation in OpenCL, using either inline PTX assembly on Nvidia GPUs with at least compute capability 6.1, or fallback emulatuion which compilers may turn into dp4a via pattern recognition.

INT8 benchmark on Nvidia H100 SXM5 80GB HBM3. dp4a ~quadruples INT8 throughput over char4 multiplication/addition.

**HGPU group** @hgpu@mast.hpc.social · Feb 16

Feb 16

HGPU group @hgpu@mast.hpc.social

Teaching An Old Dog New Tricks: Porting Legacy Code to Heterogeneous Compute Architectures With Automated Code Translation

#OpenCL #CUDA #Fortran #MPI #CodeGeneration

https://hgpu.org/?p=29744

hgpu.org · Feb 16Teaching An Old Dog New Tricks: Porting Legacy Code to Heterogeneous Compute Architectures With Automated Code TranslationLegacy codes are in ubiquitous use in scientific simulations; they are well-tested and there is significant time investment in their use. However, one challenge is the adoption of new, sometimes in…

**Collabora** @collabora@floss.social · Feb 13

Feb 13

Collabora @collabora@floss.social

Arm Mali #Panfrost Driver Lands #OpenCL C Support In Mesa 25.1

https://www.phoronix.com/news/Panfrost-Lands-OpenCL-C #OpenSource

www.phoronix.comArm Mali Panfrost Driver Lands OpenCL C Support In Mesa 25.1

**Alauddin Maulana Hirzan** @maulanahirzan@bsd.cafe · Feb 12

Feb 12

Alauddin Maulana Hirzan @maulanahirzan@bsd.cafe

Other things I have tested with FreeBSD: OpenCL with Discrete GPU via PyOpenCL lib

#FreeBSD #OpenCL #PyOpenCL

**Dr. Moritz Lehmann** @ProjectPhysX@mast.hpc.social · Feb 8

Feb 8

Dr. Moritz Lehmann @ProjectPhysX@mast.hpc.social

#FluidX3D #CFD v3.1 is out! I have updated the #OpenCL headers for better device specs detection via device ID and Nvidia compute capability, fixed broken voxelization on some #GPUs and added a workaround for a CPU compiler bug that corrupted rendering. Also AMD GPUs will now show up with their correct name (no idea why they can't report it as CL_DEVICE_NAME like every other sane vendor and instead need CL_DEVICE_BOARD_NAME_AMD extension...)
Have fun!
https://github.com/ProjectPhysX/FluidX3D/releases/tag/v3.1

**Alauddin Maulana Hirzan** @maulanahirzan@bsd.cafe · Feb 2 *

Feb 2 *

Alauddin Maulana Hirzan @maulanahirzan@bsd.cafe

aaah, nothing can beat the feel of beefed up FreeBSD with working dGPU.
1. OpenCL ✓
2. OBS RenderD129 ✓
Thanks to @vermaden for pointing my fault.

#FreeBSD #amdgpu #opencl

Recent searches

Search options

Administered by:

Server stats:

#opencl