# DevTalk.net

A blog on programming and quantitative finance

## A brief note on GPU programming

A couple of years ago, I once ‘came up’ with the idea of performing calculations faster on a graphics card. Of course, as soon as I went to the computer, it turned out that the field of GPGPU (general-purpose computation on GPUs) already existed and I hadn’t really discovered anything. Still, it was a curious revelation.

At the time, in order to get a speed-up from a graphics card, you basically had to ‘deceive’ it by presenting your data as a texture and letting a pixel shader – a programmable GPU component capable of determinining a pixel’s color – perform mathematical computations on your behalf.

Luckily, things have improved since then. GPU manufacturers realized that their processors were far more efficient than general-purpose CPUs, albeit for a certain category of tasks. Of course, given that this category featured any kind of math where large quantities of data could be processed in parallel, the universal applicability has given rise to things such as CUDA. CUDA is NVidia’s unified API for programming the graphics card – not just for graphics but also for any kind of computation where the presence of a few hundred processor cores is of benefit.

### Complexity abound

GPU computation does come with a set of limitations. It’s predominantly geared towards float data types, with double support appearing only now. Some of its numeric behavior is different from equivalent standards on the CPU – different in very obscure, hard-to-understand ways. But what’s most important is that the design of an algorithm for a GPU is entirely different than for a CPU.

In a way, GPUs exemplify the ‘multicore crisis’ – the problem where developers are in their majority not smart enough to ‘grok’ the new paradigm. This is precisely why Microsoft brought out its TPL (Task Parallel Library) API – because TPL is the absolute simplification of both task-level and data-level parallelism. Suddenly, if we want to run several methods in parallel, we use Parallel.Invoke, if we want to process a loop in parallel, we use Parallel.For/ForEach, and should we perform any kind of heavy data processing, we simply inject the AsParallel() call into our LINQ method chain and that’s it – our algorithm is supposedly made parallel, almost by magic.

With GPUs, none of these tricks work. The developer doesn’t have the option of not knowing the architecture of the GPUs they code for. Also, the combined complexity of blocks, threads, streams — together with the problems related to marshalling data to & from the GPU – make the task of writing GPU-specific algorithms so much harder.

### Simplifying things (somewhat)

Now, just to be clear, there is no antidote to the complexity of parallel algorithms. We have some ‘hacks’ for the CPU which lets us not worry about frameworks such as OpenMP, PPL/Intel TBB, but it is unlikely that a ‘classical’ (CPU-specific) implementation of an algorithm can suddenly be translated to the GPU automatically. There’s no escaping the GPU specifics.

On the other hand, there are frameworks such as Microsoft Accelerator or GPU.NET which make things more manageable by hiding the really gory bits from us. For example, GPU.NET lets you define kernels right in .Net code, which automatically takes care of the problem of marshalling data to/from the GPU by hand. Also, being a managed solution, it automatically benefits from having the support of tools such as ReSharper.

Personally, I see this persistent trend of trascoding certain languages (or their compiled form) into another representation. And if IL can be transcoded into a stream of NVidia-specific instructions, who’s to say it cannot be transcoded to, e.g., a netlist for synthesis on an FPGA?

Written by Dmitri Nesteruk

July 22nd, 2011 at 9:21 pm

Posted in DotNet