»
S
I
D
E
B
A
R
«
If you need speed, don’t talk to main!
Mar 4th, 2009 by axman6

Yesterday, I submitted my first program modification to the computer language shootout game, which is a change to the thread-ring program. If you’ve followed the shootout at all, you will probably know that haskell dominates at this benchmark, with the next contender (Mozart/OZ) being 50% slower than haskell already.

While i was having a look at the code, I noticed that the haskell entry was forkIO’ing a lot of threads, which is exactly what we want. But we weren’t quite forking enough; one of the 503 threads was the main thread. Bad idea!

I learnt a few weeks ago, while working on my AVar package, that communicating between bound and non bound threads (forkIO’d vs. main) when using things like MVars can be really slow. And this is exactly what was happening:

main = do
a <- newMVar . read . head =<< getArgs
z <- foldM new a [2..ring]
thread 1 z a

as you can see, the last/first thread was being run in main, so once per loop around the ring, we were talking to main via an MVar, and then talking to another forkIO’d thread via another MVar. So here’s what I did:

main = do
a <- newMVar . read . head =<< getArgs
z <- foldM new a [2..ring]
ret <- newEmptyMVar
forkIO (thread 1 z a >>= putMVar ret)
takeMVar ret

Two and a half lines of code. What were the results? Well, when the program was run with n = 50,000,000, the original ran in ~14.2s, with -N{1,2,3,4}, my modified one took on average ~9.3s, which is about a 33% improvement. It should be noted that when both programs are compiled without -threaded, there is almost no difference in speed at all (and they both run considerably faster than the non threaded version). Need to talk to the shootout peeps and find out if it’s against the rules to compile without -threaded, and get an even more massive speed boost.

– Axman

Other shootout news
Feb 6th, 2009 by axman6

I’ve been talking to Stephen Blackheath on the #haskell-in-depth channel, and he’s just submitted another new program to the shootout. This entry is for the regex-dna problem.

For this benchmark, each language is limited by the speed of the regex library it’s using.
Right… let me see. Here’s what I did;
I am not certain, but it seems that forcing the evaluation of s1, s2 and s3 (the blobs of input data) before parallelizing the processing
gave better CPU utilisation.
I also changed the part at the end where it does large numbers of pattern substitutions, so that it updated everything in-place in a mutable buffer.
I found that splitting the pattern substitutions into chunks sped things up by a factor of three. It looks like there’s some sort of overhead incurred on large buffers when the regex ‘match’ function creates its AllMatches array.
The reason why I needed memmove instead of memcpy is that memmove can work on overlapping buffers.
These changes improved the performance from 1:00 user / 0:42.2 real to 0:40.1 user / 0:25.12 real on my dual core, compared with 23.3 sec user for the slower non-parallel one of the two C++ implementations. (The parallel one wouldn’t build for me.)

You can see the details of the submission here, and the actual code. He also adds:

I took my unsafePerformIO’s out too. I have a reputation to maintain!

It’s good to see that someone else is working on making haskell look good (and doing a much better job than me thankfully!). Maybe soon we’ll be topping the list, or at least beating Java to move up to 4th place.

More n-bodies speedups
Feb 5th, 2009 by axman6

So I’ve been taking full advantage of the new #haskell-in-depth channel, and with the thanks of blackh and others, I’ve made a few more changes, and I’ve shaved another whole 30 seconds of my already not too shabby 1m17s. I’m now down to 47.215s, making a saving of 30s, or a saving of 39%.

If you’re wondering what I’m on about, take a look at this post for the story of my n-bodies program so far, and this port for what my code looked like before.

So how did I do it this time? Well, I decided STRefs weren’t necessary, and could easily be replaces with accumulating parameters in basically all of my code that used them. I did this in the hope they would be kept in registers, but even if they’re not, it’s still faster. I also saved about 3 seconds by inlining all my read* and write* functions with {-# INLINE #-} pragmas that would write or fetch the various parts of the Body’s info. So here’s the code, first the original one:

advance dt ps = do
forM_ index1 $ \i -> do
-- (B p v m) <- readBody ps i
p <- readPos ps i
v <- readVel ps i
m <- readMass ps i
vvar <- newSTRef v
forM_ (index2 i) $ \j -> do
-- (B p' v' m') <- readBody ps j
p' <- readPos ps j
v' <- readVel ps j
m' <- readMass ps j
let dp = p .-. p'
dst = mag dp
magnit = dt / (dst  dst  dst)
change = magnit . dp
(sca1,sca2) = (m' . change, m *. change)
writeVel ps j (v' .+. sca2)
modifySTRef vvar (.-. sca1)
vvar'@(V nx ny nz) <- readSTRef vvar
-- unsafeIOToST (print vvar')
writeVel' ps i nx ny nz
writePos ps i $ move' dt (B p vvar' m)

And the new one, with an inner loop:

advance dt ps =
forM_ index1 $ \i -> do
-- (B p v m) <- readBody ps i
p <- readPos ps i
v <- readVel ps i
m <- readMass ps i
let loop2 !j !v@(V !x !y !z)
| j > numPs = return v
| otherwise  = do
p' <- readPos ps j
v' <- readVel ps j
m' <- readMass ps j
let dp = p .-. p'
dst = mag dp
magnit = dt / (dst  dst  dst)
change = magnit . dp
(sca1,sca2) = (m' . change, m *. change)
writeVel ps j (v' .+. sca2)
loop2 (j+1) (v .-. sca1)
v' <- loop2 (i+1) v
-- unsafeIOToST (print v')
writeVel ps i v'
writePos ps i $ move' dt (B p v' m)

»  Substance: WordPress   »  Style: Ahren Ahimsa
© Alex Mason (Axman6) 2009