»
S
I
D
E
B
A
R
«
Playing with the FFI and OpenCL
September 22nd, 2009 by Axman6

A while ago I found Björn Edström’s quick intro to the Haskell FFI , and I decided to follow along and find out what it was to interface Haskell with C. I found it quite interesting, noticed the C was faster than the haskell version I wrote, and then sort of forgot about it all.

Then Snow Leopard came out. I read the Ars Technica review of it, and realised that Grand Central Dispatch was frigging awesome, because it makes doing things concurrently in C damn easy. I went off and implemented a few programs using it, and had a lot of fun. It was especially interesting because I knew we were coming up to learning about concurrency in C in our Concurrent and Distributed Systems course. I’d heard about pthreads, and could see they were not going to be pleasant to use, at least not without some higher level interface. Blocks along with GCD are obviously an improvement, though certainly not an alternative.

It was about then that I realised two things: Firstly, I could parallelise the dctiv function from from Björn’s article using GCD, and secondly, that I was taking a Signals Processing course, and I actually understood what the code was doing. So off I went, turning this code:


void dctiv(int points, float *in, float *out){
    int k, n;
    float sum, cosConst, kk;
    for (k = 0;k < points; k++ ){
        sum = 0;
        kk = (k + 0.5f);
        cosConst = (PI/points) * kk;
        for (n = 0; n < points; n++){
            sum += in[n] * cos(cosConst * (n+0.5f));
        }
        out[k] = sum;
    }
}

into this:

#include <dispatch/dispatch.h>
...
void dctivGCD(int points, float *in, float *out){
    dispatch_apply(points, dispatch_get_global_queue(0,0), ^(size_t k){ 
        float sum, cosConst, kk;
        int n;
        sum = 0;
        kk = (k + 0.5f);
        cosConst = (PI/points) * kk;
        for (n = 0; n < points; n++){
            sum += in[n] * cos(cosConst * (n+0.5f));
        }
        out[k] = sum;
    });
}

Notice how similar that code is to the original? It also has the added bonus of being slightly less error prone, since you just specify the number of times you wish the block you provide be executed, and GCD places n work units on the provided queue for you for, and passes each one their own id, which is used in this case to tell the block which place in the output the result should be placed in.

So I went off and benchmarked this new code, and found that on my Core 2 Duo, it was about 40% faster. Hoorah! (for some reason, I can now not get this code to execute on more than one thread, so I’m no longer getting any speedups :( But that’s another story). That was all well and good, but where to go next?

While I think GCD is by far the most important feature of Snow Leopard, it also has another related feature that has stirred a lot of attention in the in the lead up to Snow Leopard’s release: OpenCL, the Open Computing Language. It is supposed to provide a common interface allowing a programmer to take advantage of parallelism on their CPU, GPU, Cell or what have you. I’d gotten somewhat interested by this, and decided to take a look at the Apple OpenCL examples. I’d thought that OpenCL would be hard to use, and I was right. Luckily, the OpenCL Hello World example was an example of something that was very close to what I was doing in the dctiv function. I decided to give OpenCL a whirl, and copied and pasted the most of the hello.c file into a new function, modified it… and it didn’t work. I’m not going to go into why it didn’t work, because it was mainly my stupidity, but I will say that if you if you’re working with floats in C/OpenCL, make sure your constants are in the form of 0.5f.

After some work, I finally got my kernel to compile:


const char KernelSource = "\n" \
"__kernel void DCT(                                                     \n" \
"   __global const float input,                                        \n" \
"   __global float* output,                                             \n" \
"   const unsigned int count)                                           \n" \
"{                                                                      \n" \
"   int n, i = get_global_id(0);                                        \n" \
"   float sum, cosConst, kk;                                            \n" \
"                                                                       \n" \
"   sum = 0;                                                            \n" \
"   kk = (i + 0.5f);                                                    \n" \
"   cosConst = (3.141592653589793f/count) * kk;                         \n" \
"   for(n = 0; n < count; ++n)                                          \n" \
"   {                                                                   \n" \
"       sum += input[n] * cos(cosConst * (n+0.5f));                     \n" \
"   }                                                                   \n" \
"   output[i] = sum;                                                    \n" \
"}                                                                      \n" \
"\n";

I had also made a FFI binding to the function that ran this (left out, because it’s 149 lines long). I started benchmarking, and couldn’t quite believe the speed: on 215 elements, the C takes 44.43s, the OpenCL version on the same input takes 7.11s. I’m a little disappointed with my Haskell versions, running about half as fast as the C version. For some reason I can’t seem to get them to actually run in parallel, it’s rather frustrating. For those interested, here are the two haskell versions:

dctivHask :: Int -> [Float] -> [Float]
dctivHask points ps = parMap rnf forK [0..fromIntegral (points-1)]
    where
        !piOn = myPi / fromIntegral points
        forK k = foldl' (f $ (k+0.5)  piOn) 0 $ zipWith V [0.0..] ps
        f !kp !sum' (V !n !p) = sum' + p  cos (kp * (n+0.5))

Which takes about 71.22s, and using unboxed arrays:

dctivHaskArr :: Int -> [Float] -> [Float]
dctivHaskArr points ps = pmap f . range . bounds $ input
    where input = listArray (0,points-1) ps :: UArray Int Float
          f :: Int -> Float
          f !k = g 0.0 0
            where cosConst = (myPi / fromIntegral points)  (0.5 + fromIntegral k) :: Float
                  g :: Float -> Int -> Float
                  g !num !n = {-# SCC "dctivHaskArr-worker" #-}
                      let nn = 0.5 + fromIntegral n :: Float
                        in if n < points
                            then g (num + ((input ! n)  cos (cosConst * nn))) (n+1)
                            else num 

Which takes about 66.96s (these are just one off test numbers because I get bored waiting for the haskell versions to finish).

I should probably follow up this post with… more stuff… so stay tuned (and please leave comments, because i don’t know what that stuff should be yet!)

– Axman


2 Responses  
  • Jeff Heard writes:
    September 23rd, 2009 at 12:22 AM

    I released a whole binding to OpenCL about a month ago on Hackage. OpenCLRaw.

  • Axman6 writes:
    September 23rd, 2009 at 3:41 AM

    Hi Jeff, I’ve had a look at your bindings, and I couldn’t figure out how to use them (not that I tried very hard). I was more interested in playing with OpenCL using C, and playing with the FFI. I’ll have to try implementing this using your bindings when I have the semester break next week.


Leave a Reply

»  Substance: WordPress   »  Style: Ahren Ahimsa
© Alex Mason (Axman6) 2009