Cereal, as you may know from my previous posts is a library for parsing binary data from strict ByteStrings. It is very similar to the binary package, but importantly, provides both an Alternative instance, and an
Either String a return type for the decode function, which tells you where the parse failed.
I’ve been playing around with cereal lately in jlouis’ haskell-torrent project, rewriting the various binary parsing and producing parts of the program (the torrent file parser, and the wire protocol parser). I though it would be nice to share some of the code used for these, to demonstrate how easy cereal makes it to do such things.
To begin with, I’ll show you the part that decodes and encodes torrent files (if needed in the future). Torrent files are encoded using a very simple encoding, known as bencoding, which consists of four major primitives: Integral numbers, Strings of bytes, Arrays of bencoded objects, and Dictionaries of String, bencoded object pairs. This is very nicely represented using this datatype:
data BCode = BInt Integer
| BString B.ByteString
| BArray [BCode]
| BDict (M.Map B.ByteString BCode)
the specification for bencoded data goes something like this:
Integers are encoded as the ASCII character for ‘i’ as a byte,
followed by the digits of the integral value, terminated by the
ASCII byte for ‘e’.
Eg: the number ‘42’ would be encoded as “i42e”
Strings are encoded as the digits of their length, followed by a
colon (‘:’), then the bytes of the string. these strings are really
just byte sequences, and probably shouldn’t be treated as having
an encoding (as jlouis and I found out when I tried to test the current
code on GHC 6.12.1, with the BString type using Strings, instead of
ByteStrings, and finding out that the simple test contained byte sequences
that could not be represented as Strings).
Eg: the string “hello” would become “5:hello”, “hello world” would
become “11:hello world”
Arrays are encoded as ASCII ‘l’ (for list I believe), followed by any
number of bencoded objects, terminated by an ASCII ‘e’. (This is where
using binary became difficult, as you had to explicitly check whether
you had reached the terminating ‘e’ using
lookAhead when parsing before attempting
to parse another bencoded object, du the the lack of actual failure handling)
Eg: ["Hello", 123] would become “l5:helloi123ee”. Notice how we’ve used the
previous definitions for integral numbers, and strings.
Dictionaries are encoded as an ASCII ‘d’, followed by the String, object
pairs, followed by an ASCII ‘e’.
Eg: fromList [("test",123),("arr",[1,2,"hello"])] would become
It looks a bit of a mess, but it is quite efficient.
When writing my Serialize instance (Cereal’s version of the Binary class) for the BCode type, I decided it would be much easier to write the put methods first. This turned out to be rather straight forward, once I’d written a few helper functions.
toW8 :: Char -> Word8
toW8 = fromIntegral . ord
fromW8 :: Word8 -> Char
fromW8 = chr . fromIntegral
toBS :: String -> B.ByteString
toBS = B.pack . map toW8
fromBS :: B.ByteString -> String
fromBS = map fromW8 . B.unpack
wrap :: Char -> Char -> Put -> Put
wrap a b m = do
putWord8 (toW8 a)
putWord8 (toW8 b)
putShow :: Show a => a -> Put
putShow x = mapM_ put (show x)
With these in hand, I set to work implementing the put function. The Integer and Array functions were straight forward:
instance Serialize BCode where
put (BInt i) = wrap 'i' 'e' $ putShow i
put (BArray arr) = wrap 'l' 'e' . mapM_ put $ arr
The Dictionary and String implementations weren’t too bad either:
put (BDict mp) = wrap 'd' 'e' dict
where dict = mapM_ encPair . M.toList $ mp
encPair (k, v) = put (BString k) >> put v
put (BString s) = do
putShow (B.length s)
putWord8 (toW8 ':')
As you can see, the code is quite clear, and matches the specification quite well.
Parsing the data was the next step. this proved a little more difficult, but with my recent (shallow) experience with Parsec, I knew what was needed.
I decided to start by writing some useful combinators (this is a lie, I wrote them when needed, but lying makes the post flow better >_>). These included the following:
getCharG :: Get Char
getCharG = fromW8 <$> getWord8
char :: Char -> Get ()
char c = do
x <- getCharG
if x == c
then return ()
else fail $ "Expected char: '" ++ c:"' got: '" ++ [fromW8 x,'\'']
getWrapped :: Char -> Char -> Get a -> Get a
getWrapped a b p = char a > p < char b
many :: Get a -> Get [a]
many p = many1 p
mplus return 
many1 :: Get a -> Get [a]
many1 p = (:) <$> p <*> many p
digit :: Get Char
digit = do
x <- getCharG
if isDigit x
then return x
else fail $ "Expected digit, got: " ++ show x
getDigits :: Get String
getDigits = many1 digit
My favourite two definitions here are
, which nicely show the use of Alternative: they are mutually recursive, with
being the only one of the two to actually do and parsing, while
checks to see if
failed to parse one object using the parser p. It’s really quite beautiful, and makes the code that follows a hell of a lot nicer to write. This is where the love mentioned in the title comes in by the way.
With these in hand, I could now go ahead and write the actual parsers for various BCode types. Parsing BInts and BArrays is dead simple now:
getBInt :: Get BCode
getBInt = BInt . read <$> getWrapped 'i' 'e' getDigits
getBArray :: Get BCode
getBArray = BArray <$> getWrapped 'l' 'e' (many get)
As as side note, I’ve now come to see just what the folks on #haskell were on about when they said Applicative is nice. I think I’ve fallen in love (yet again!).
BStrings were a little more difficult, but not hard, given what I’ve just written:
getBString :: Get BCode
getBString = do
count <- getDigits
BString <$> ( char ':' *> getByteString (read count))
Here we get as many digits as we can, followed by a colon, and then take the number of bytes the digits specified. Finally, we have the BDict definition, which also is quite nice, if slightly annoying with its use of pattern matching (don’t get me wrong, i love pattern matching, but it’s the only place it’s used in the parser
getBDict :: Get BCode
getBDict = BDict . M.fromList <$> getWrapped 'd' 'e' (many getPairs)
where getPairs = do
(BString s) <- getBString
x <- get
Putting it all together, we finally have a definition for the get function in the Serialize class.
get = getBInt <|> getBArray <|> getBDict <|> getBString
A rather clean, elegant, and hopefully correct serialiser and deserialiser for the bencoded format used in torrent files. I’m considering releasing this code as a separate package on hackage, but I’m still not sure how widely it might be used. I have a string feeling that that would not be very wide at all, but that a library of more advanced combinators for cereal would make life a lot easier for others like me who have some strange binary formats that need to be parsed in an efficient manner.
Please, I implore you, do let me know what you think of this all, I’m always interested in seeing what others think of my code, and ways to improve it.
Until next time,