DevTalk.net

A blog on programming and quantitative finance

Performance of String Histogram Building in C++ AMP

without comments

We all know that GPUs are excellent number crunchers. Given a large amount of data, GPUs can process numerics up to 2 orders of magnitude more efficiently than today’s multi-core CPUs. But to date, the use of GPUs has been mainly restricted exactly to this scientific/mathematical domain, without much concern to alternative uses.

In this post, I want to take a look at other ways of exploiting GPUs – specifically, how individual strings as well as string arrays can be processed on the GPU and what kind of performance benefit we would be able to get. For the purposes of my investigations, I will be using an ATI 6800 series GPU with an ordinary Core 4 Quad. I will use ordinary C++ and C++ AMP as the technologies being compared. I will use primary Latin characters (in a range that can be displayed in a console), but in Unicode (i.e., wchar_t-based) strings as befits a modern framework.

Please note that the examples use minimum optimization of C++ code, i.e., common STL algorithms and approaches are used and no attempt is made to improve or replace basic C++ constructs. This, I beleive, constitutes a fairer test than attempting to fine-tune C++ for peak performance.

Character histogram

We’ll begin with a simple case: given a rather lengthy text, how long would it take to build a histogram of all the characters present in the string? We’ll use a special function to randomly generate strings of different sizes:

// uses chars 32-127
void fill_string(wchar_t* chars, int count)
{
  for (int i = 0; i < count; ++i)
  {
    chars[i] = 32 + rand() % 95;
  }
}

We’ll perform the experiment on strings with length from 2 to 224 to ascertain how the algorithm performs under different conditions. First, we begin with a C++ implementation – without any constraints on the architecture, we’ll use a simple map-based approach:

typedef concurrent_unordered_map<wchar_t,int> histogram;
unique_ptr<histogram> cpu_histogram(wchar_t* str, size_t count)
{
  unique_ptr<histogram> result(new histogram);
  parallel_for((size_t)0, count, [&](int i)
  {
    (*result)[str[i]]++;
  });
      
  return result;
}

The GPU function is quite a bit more complicated. First of all, there is no real way of treating elements as wchar_t types since C++ AMP does not recognize elements smaller than an uint32_t. Thus, any view over a wchar_t array has to use the uint32_t data type.

In order to test the histogram, I used an existing implementation written by Daniel Moth. I won’t replicate the code here. All I changed in Daniel’s implementation is the definition of the data type, which I set to wchar_t, as well as an implementation of an alternative constructor that took actual data rather than the data size.

From then on, what I did is created strings of sizes 2n and filled them in with random printable characters:

int size = sizes[i];
wchar_t* target = new wchar_t[size+1];
ZeroMemory(target, (size+1) * sizeof(wchar_t));
fill_string(target, size);

Then, I simply timed the ‘cost’ of the CPU and GPU calls and averaged out results:

aggregate = 0.0;
for (int s = 0; s < 20; ++s)
{
  Timer gpuTimer;
  gpuTimer.Start();
  auto gh = gpu_histogram(target, size);
  gpuTimer.Stop();
  aggregate += gpuTimer.Elapsed();
}
wcout << (aggregate / 20.0) << endl;

Measuring Performance

For performance measurement I found yet another useful blog post describing the steps necessary to get the measurement just right. I also used a high-res timer from this article (isn’t the internet wonderful?).

The end result is the following performance measurements from the CPU and GPU. This is from a Release build with all optimizations enabled. Each iteration was repeated 20 times with the results averaged.

Figure 1. Comparison of GPU and CPU performance for building a string histogram. The X axis corresponds to a string of length 2x, the Y axis represents the elapsed time in milliseconds, and is binary-log-scaled.

It’s clear that, up to some point, the CPU has an advantage, as the GPU is consistently paying a small but annoying start-up cost. As far as measuring the benefits, the parallel lines on the right hand side of the chart suggest a constant 8× performance improvement when using the GPU. Performance benefits of the GPU can only be appreciated when dealing with strings with length greater than 216 (65536) characters.

Conclusion

This experiment is just a small performance investigation, probably full of various methodological errors and inaccuracies, but the above has certainly been benefitial in terms of figuring out that:

  • The GPU kernel appears to have a fixed start-up cost associated with it. However, the start-up cost seems to be around 2ms, which is fairly insignificant, except maybe for things like high-frequency trading.

  • The GPU appears to have a linear advantage over the CPU (about 3 orders of binary magnitude), which is surprising, because I would expect non-linear divergence in the results.

  • The current approach seems to be entirely unfit for calculating histograms with unlimited character sets. The reason is that currently, a histogram array matches the size of all possible characters, and then there are equally many numbers of such arrays. Essentially, to make lots of arrays of size 2sizeof(wchar_t) is impossible – the GPU just doesn’t have thas much memory.

I’ll certainly be playing more with C++ AMP and string processing specifically. Meanwhile, if you’d like to get the source code for this article, you can find it here. Who knows, maybe your performance measurements will be entirely different? Or maybe the algorithm won’t even run on your machine? At any rate, let me know. ■

Written by Dmitri Nesteruk

August 24th, 2012 at 8:09 pm

Posted in CppAmp

Tagged with , ,