OpenCL From Haskell – Hello World!
December 17th, 2011 by axman6

It’s been a very long time since I’ve even looked at this blog, so I thought I should do something about that. For the past two days, I’ve been working on making the OpenCLWrappers nee OpenCLRaw package more usable, while fixing some bugs while I’m at it.

The main change I wanted to make was to move from everything returning IO (Either ErrorCode a) or IO (Maybe ErrorCode) to a more useable OpenCL monad. The obvious way to do this is to use ErrorT:

> type OpenCL a = ErrorT ErrorCode IO a

(Be sure to comment out the previous line if you decide to use this is a literate haskell file.)

This involved first converting all the IO (Maybe ErrorCode) functions to IO (Either ErrorCode ()) first, and then implementing the OpenCL monad wrapper on top of that. This has resulted in a new set of modules under System.OpenCL.Monad.

To demonstrate how to make use of this initial work, I’ll use a slightly modified version of the canonical CUDA/OpenCL example which takes two vectors of floats, an adds them. My slight modification is to make the kernel compute the hypotenuse between the two vectors. First let’s start with the OpenCL kernel, which should make more clear what we’re trying to do:

__kernel void vectorHypot(
    __global const float * a,
    __global const float * b,
    __global       float * c)
    int nIndex = get_global_id(0);
    c[nIndex] = sqrt(a[nIndex] * a[nIndex] + b[nIndex] * b[nIndex]);

Next comes the Haskell code. To make use of this code, you’ll need my latest version of OpenCLWrappers from github.

We start, as with any decent literate haskell document, with various imports to break the flow of the document (note to self, investigate using anansi in the future to see if it makes this easier).

> {-# LANGUAGE BangPatterns #-}
> module Main where
> import System.OpenCL.Monad
> import System.OpenCL.Wrappers.Types
> import System.Random (randoms, mkStdGen)
> import Foreign.Marshal.Array (newArray, peekArray)
> import Foreign.Marshal.Alloc (free)
> import Foreign.Ptr (castPtr, nullPtr, Ptr)
> import Control.Monad (forM, forM_)
> import Data.Bits ((.|.))
> import Data.Time (getCurrentTime, diffUTCTime)

Next, we have a function to time execution times. I’m pretty sure it doesn’t work, so I’d love some suggestions for a better way to do this!

> time :: IO a -> IO a
> time x = do
>     !before <- getCurrentTime
>     !a <- x
>     !after <- getCurrentTime
>     print $ diffUTCTime after before
>     return a
And finally on to the guts of the program.We start by reading in the source for the file. Then we create two lists of len‘ random Float values. I’m sure there are better ways to do this too, but I was after a quick (ha!) and dirty result.

The lists are then written to arrays, which are cast to pointers of () (equivalent to void *), so that it matches the types of required by clCreateBuffer later. Then we run the computation (via runHypot), the arrays are read and freed, and we check to see whether the results differ by much, compared to what we expect.

> len = 2^22 :: Int
> main = do
>     str <- readFile "kernel.cl"
>     let a = take len $ randoms (mkStdGen 1) :: [Float]
>         b = take len $ randoms (mkStdGen 2) :: [Float]
>     pa' <- newArray a
>     pb' <- newArray b
>     pc' <- newArray (replicate len (0.0 :: Float))
>     psize' <- newArray [len]
>     let pa = castPtr pa' :: Ptr ()
>         pb = castPtr pb' :: Ptr ()
>         pc = castPtr pc' :: Ptr ()
>         psize = castPtr psize' :: Ptr ()
>     time $ runHypot str pa pb pc
>     cres <- peekArray len pc'
>     free pa'
>     free pb'
>     free pc'
>     time $ print
>          $ take 100
>          $ map (\(a,b) -> a-b)
>          $ dropWhile (\(a,b) -> abs (a-b) < 10e-7)
>          $ zipWith3 (\a b c -> (sqrt (aa + bb), c)) a b cres

Now we get to the uh… fun part. It turns out that OpenCL is amazingly tedious for such a simple task. The process of running a kernel is as follows:

  1. Find out about the platforms available
  2. Find out about all the devices you have access to. In my case, on my MacBook Pro I have access to one CPU, and one GPU. This gets printed on the following line.
  3. Select a device to run the computation on. I chose the GPU, mainly because choosing the CPU didn’t work for some reason. I may investigate this in the future
  4. Create an OpenCL context, which is used for all sorts of stuff…
  5. Create a command queue for the device. Each action you wish to perform on the device will be queued here, which including moving data to the device’s memory, running the kernels themselves, and moving data back to the host’s memory
  6. Next the program is created from the course passed in (originally from kernel.cl remember?)
  7. Next we compile the program. You can see I’ve had to jump through some hoops to make this work. I technically could have just run clBuildProgram, but the way I’ve done it allows me to get some info about what went wrong with the compilation. Here I print out the compile/error log returned from the compiler if something goes wrong.
  8. Buffers are created, which will have the contents of the host pointers we allocated and passed as arguments copied into them. This step is what moved the data onto the device.
  9. Finally we get to running the kernel. You may be wondering why I’m using the magic number maxWISize `div` 4 here… I’m using it because it worked. I was hoping that just setting the work item size to maxWISize would work, but for some reason it doesn. I might investigate this later…
  10. Now all that’s left is to read the data back from the device, and then free the memory used on the device also. Once this is done, the pointer pc should contain our results.

> runHypot :: String -> Ptr () -> Ptr () -> Ptr () -> IO (Either ErrorCode ())
> runHypot str pa pb pc = runOpenCL $ do > pids <- clGetPlatformIDs -- 1 > dids <- fmap concat $ forM pids $ \pid -> > clGetDeviceIDs pid clDeviceTypeAll -- 2 > infos <- forM dids $ \did -> > clGetDeviceInfo did clDeviceType > liftIO $ print infos > let devid = dids !! 1 -- 3 > ctx <- clCreateContext [] [devid] Nothing nullPtr -- 4 > queue <- clCreateCommandQueue ctx (dids !! 1) [] -- 5 > > prog <- clCreateProgramWithSource ctx str -- 6 > err <- liftIO $ runOpenCL $ clBuildProgram prog [devid] "" Nothing nullPtr > case err of -- ^ 7 > Left err -> do > x <- clGetProgramBuildInfo prog devid clProgramBuildLog > liftIO $ print x > Right x -> return x > kern <- clCreateKernel prog "vectorHypot" -- 8 > > let bytes = fromIntegral len * 4 -- 9 > pad' <- clCreateBuffer ctx (clMemReadOnly .|. clMemCopyHostPtr) bytes pa > pbd' <- clCreateBuffer ctx (clMemReadOnly .|. clMemCopyHostPtr) bytes pb > pcd' <- clCreateBuffer ctx clMemWriteOnly bytes nullPtr > pad <- liftIO $ newArray [pad'] > pbd <- liftIO $ newArray [pbd'] > pcd <- liftIO $ newArray [pcd'] > clSetKernelArg kern 0 8 $ castPtr pad > clSetKernelArg kern 1 8 $ castPtr pbd > clSetKernelArg kern 2 8 $ castPtr pcd > > (DeviceInfoRetvalCLsizeiList (n:_)) <- > clGetDeviceInfo devid clDeviceMaxWorkItemSizes -- ^ 10 > let maxWISize = fromIntegral n > liftIO $ print maxWISize > eventRun <- > clEnqueueNDRangeKernel queue kern -- 11 > [fromIntegral len] > [fromIntegral maxWISize div 4] [] > > eventRead <- clEnqueueReadBuffer pcd' True 0 bytes -- 12 > pc queue [eventRun] > > clEnqueueWaitForEvents queue [eventRun, eventRead] -- 13 > clReleaseMemObject pad' > clReleaseMemObject pbd' > clReleaseMemObject pcd'

To compile, make sure you call ghc with -lopencl or -framework OpenCL on OS X: ghc -framework OpenCL main.lhs

As you can see, this is a hell of a lot of work to go through for such a simple task, and this is why I hope to make a higher level set of wrappers in the nearish future. I would love to be able to do everything using either Vectors or Repa arrays (the latter would be more ideal). It would also be nice to create a DSL for creating OpenCL kernels, but that’s a long way away at the moment.

I think I’ll focus first on making a cleaner interface to things like attaining a context, and allocating data.

Anyway, that’s it for now, let me know if you have any questions, or is anything doesn’t make sense.

2 Responses  
  • Martin Dybdal writes:
    December 18th, 2011 at 12:23 AM

    You probably want to insert a call to clFinish as the first and last thing in your time function. In this way, you will make sure that allocation of the arrays are finished before starting the timer and that the enqueued operation is completely finished before stopping the timer again.

  • Ivan Miljenovic writes:
    December 18th, 2011 at 9:28 AM

    I have a suspicion that your time function needs to use deepseq rather than just a bang for the value you’re evaluating…

Leave a Reply

»  Substance: WordPress   »  Style: Ahren Ahimsa
© Alex Mason (Axman6) 2009