»
S
I
D
E
B
A
R
«
N-bodies evolution
Feb 4th, 2009 by axman6

So my last post got submitted to reddit by dons, and someone asked if I translated the code from C, so I thought I’d talk about my evolution of the problem, and how big the difference is runtime from start to finish is so far. First I should explain the problem. What needs to happen is I take 5 planets and the sun, each with an initial position and velocity, and a mass. Then I set about modifying each body’s position and velocity according to the gravitational pull of all the other bodies. For the purposes of this problem, in the end we’re only interested in the total energy in the system, which changes as time goes along. For the actual shootout, the bodies must be iterated 50,000,000 times, with a delta time of 0.1 (Not actually sure what that’s 0.1 of, possibly days). The reason I got interested in this problem was because I decided to take a look at the current haskell submission. I was a little horrified when I say it as all pointer nonsense and back magic. Full of stuff haskell can do fine (And possibly better and more safely that other languages), but should really be avoided where possible. I wanted to see if you could write a fast version, using more accessible techniques.

Lists

So my initial try used lists. I thought about how I could make is faster, and cool at the same time, and failed by using Control.Parallel, and I think all I have to say about this version that I am not proud of it, and when I just tried to run it with n=50000000, I decided to kill it after it reached 3.3GB of RAM usage in under a minute. Yeah, not great at all.

Parallel Arrays

So my next iteration of the problem basically continued on from the list version, but this using parallel arrays. As I’d basically followed in the same steps I’d used for the list version, its performance was on par with that of the list version; it was also killed after using 3GB of RAM in about 30 seconds, and again, I was not proud. It should also be mentioned that neither the list of parallel array versions produced the correct answers with n=1000, so they were never going to be contenders

STArrays

It was about this time I discovered the ST monad, and I fell in love (Sort of). The ST monad allows you to write imperative programs, using mutable variables and arrays, in a totally pure way. It’s a little like the State monad, but it has a generic state of ‘forall s’. I’m not going to go into the details of ST, but if you’re interested, check out SPJ’s paper on the subject of Lazy Functional State Threads (.ps.Z). This is where I finally started to get some good results. The program runs in about 1.4MB RAM (:O), which means I could actually time its runtime. So I did that, and well, at least it finished. The runtime was 14m46.957s according to time, which is still considerably faster than both Pythons, JavaScript under TraceMonkey, ruby, perl, php, and ruby again, but it’s slower than Lua, which is 24x slower than the current fastest C++. But I was getting somewhere! I could see this was the way to go, with memory usage being the most important thing so far, ST made keeping it really low easy as can be. Now for some speed…

STUArrays

I can’t remember where I found out about STUArrays, but I could see they were the obvious solution. They’re almost as close to C arrays as you’re going to get in nice haskell, while still remaining pure, and not so ugly. Basically STUArrays are unboxed arrays (U) in the ST monad (ST). This means that when out put a type into them, instead of putting a pointer to the data in the type inside the array, you put the actual data into the array, so it’s all sitting next to each other. This of course restricts the type you can out into an array somewhat to things like Doubles, Ints, Words, and the few other things with single constructors (This rules out Bools and Integers). There is quite a speed advantage to using STUArrays, but writing your own instance for say MArray (STUArray s) Body (ST s) is a little off putting. Luckily I had talked with Dons about what I was working on, and he’d already started working on something similar. So I nicked his MArray instance, modified it, and… Failed. For some reason, I kept getting segfaults when trying to get this STUArray stuff to work with my Body type with n > 22:

data Vec3 = V !Double !Double !Double
data Body = B {
pos ::  {-# UNPACK #-} !Vec3,
vel ::  {-# UNPACK #-} !Vec3,
mass :: !Double
}
I was not happy! And I still have no idea what went wrong. But I moved on.

STUArray s Int Double

I decided I’d avoid the MArray instance completely and use Double STUArrays directly. This turned out to be a good idea indeed, bringing some nice performance benefits. By doing this, I didn’t need to worry about taking out redundant data from the array by fetching a whole Body each time, I could pick out the data I needed, be it the Velocity, Mass or Position, or all three if I needed them all. With this work, I brought the runtime down to around 2m40 (I think), an approx. 80% improvement in runtime! This was massive! I was certainly onto something. I was now sitting behind Erlang at 5.9x and below Scheme PLT at 9.1x.

Performance

Ok, I’d come this far, but I was now stuck for ways to increase the speed. I’d done some algorithm rejiggering to see if I could gain any speed, and I managed to cut another 10 seconds or so off, but that wasn’t going to cut it. So, I took a look at the performance pages on the wiki. The Arrays page was most useful, suggesting that the use of unsafeRead and unsafeWrite could make huge improvments by removing redundant bounds checking. The floating point page also helped a little, with the suggestion of using the -fexcess-precision flag to speed up the floating point calculations. With these two changes, I managed to cut a whole minute 15 off the time. Yes, I just got a 50% speedup! Now I was getting happy. I was now beating Erlang HiPE, but sadly I was nowhere near C#’s 45 seconds, and definitely nowhere near C++’s 21 seconds. Still, how often do you get to produce a 50% speedup in some already relatively fast code?

Conclusion and the future

So what next? Well, I think I can save some more time be using explicit loops with accumulation parameters instead of using STRefs which I have a feeling may carry some overhead. I don’t know what kind of gains I can expect, but I doubt they’ll be that large. I might also consider making my code even more C like, by directly accessing parts of the arrays directly. I may even get rid of the whole ST thing, and working in IO and with more primitive operations, but by doing that, I’ll be moving away from the easier to understand code I have now. So, what can I conclude from all this? Well, Haskell sure can be fast, while still remaining beautiful. But making it really fast starts to get ugly (Introducing unsafe* does not help :\ ). There are obvious way to do things in haskell, and even some obvious fast ways to do them. But really fast code takes effort (Which is true of most, if not all languages). I’ve learnt a lot while writing this program, and I’ve still got a lot to learn. Wish me luck!

»  Substance: WordPress   »  Style: Ahren Ahimsa
© Alex Mason (Axman6) 2009