Uhh, hello. Welcome to my first blog post ever – and thanks Axman6 for letting me be a “guest blogger”.
It’s rather unfashionable on #haskell, but I like XML. So, 18 months ago, I took over the hexpat package from Evan Martin. It was going to be a small project – a simple XML parser binding to Expat. The fastest Haskell XML parser alive. Or so I thought.
It’s become a passion, a way of life. It’s XML parsing in Haskell the way I think it should be done. The best as well as the fastest. (I like to think big.)
I’ve finally finished adding all the features that I and a number of contributors wanted, and I would now like to announce that hexpat is going beta. I want to make this package really, really good, so please help by testing and critiquing. I want to stabilize hexpat, but hexpat-iteratee will be unstable for a while yet.
The future is chunky
The cherry on top of the hexpat galaxy is the still experimental hexpat-iteratee based on Oleg Kiselyov’s iteratee, which is a bit of a hot ticket these days. It provides lazy XML parsing without the practical issues and philosophical dodginess inherent in Haskell’s lazy I/O through functions like hGetContents.
hexpat-iteratee allows for effectful XML processing done in a functional way, and the magic behind this is Yair Chuchem’s humbly named List package. It is “merely” a generalization of lists, and I think it deserves to be a common piece of infrastructure.
The example project is a chunked XML-over-TCP movie database lookup server. Every home should have one. So, let’s start like all good blogs do, with imports:
import qualified Data.ByteString as B
import qualified Data.ByteString.Unsafe as B (unsafeUseAsCStringLen)
import Data.List.Class as List
import Data.Text (Text)
import qualified Data.Text as T
import System.Posix.IO (handleToFd, fdWriteBuf, closeFd)
import System.Posix.Types (Fd)
import qualified Text.XML.Expat.Chunked as Tree
The first thing we want to do is listen on a socket. I could use handles, sockets, or file descriptors. With handles, this code does not work interactively. Disabling the buffering does not seem to work at all in GHC 6.10 or 6.12. Sockets would be ideal, but to save me writing an iteratee
driver, I’m left with file descriptors which unfortunately means this code only works on GHC 6.12 on a POSIX system. fdPutStrBS
is the only glue I need then – it writes a ByteString to a Fd. Here’s the code:
main :: IO ()
main = do
let port = 6333
putStrLn $ "listening on port "++show port
ls <- listenOn $ PortNumber port
forever $ do
(h, _, _) <- accept ls
forkIO $ handleToFd h >>= \fd -> do
iter <- parse defaultParserOptions (session (fdPutStrBS fd))
result <- enumFd fd iter >>= run
fdPutStrBS :: Fd -> B.ByteString -> IO ()
fdPutStrBS fd bs = B.unsafeUseAsCStringLen bs $ \(buf, len) ->
writeFully (castPtr buf) (fromIntegral len)
writeFully _ len | len == 0 = return ()
writeFully buf len = do
written <- fdWriteBuf fd buf len
if written < 0
then fail "write failed"
else writeFully (buf `plusPtr` fromIntegral written) (len written)
Once we’ve accepted the connection, we get parse
) to make us an iteratee. The second argument, “session (fdPutStrBS fd)” is the handler for processing the document. We then pass this iteratee to iteratee
, whose job it is to pull the input data out of the Fd and feed it into the parser. parse
is monadic in order that it can start the handler before it receives the first data block through the iteratee. This is necessary in case the handler wants to generate output before it gets any input, which we want to do here.
The handler is a co-routine. When it runs out of input data, it gets suspended, and control returns to enumFd.
session :: (B.ByteString -> IO ())
-> ListOf (UNode IO Text)
-> XMLT IO ()
session writeOut inputXML = do
let outputXML = formatG $ indent 2 $ Element "server"  (processRoot inputXML)
execute $ liftIO . writeOut =<< outputXML
is a hexpat
function to take a tree node and format it as XML, returning one of Yair’s Lists of ByteStrings. indent
is a filter that adds pretty indenting. The Element is the top level tag of our output XML tree, and its third argument “processRoot inputXML” evaluates the child nodes of the output document. The entire processing of the document is in a functional style.
execute here makes all the IO actually happen. It iterates over a List of monadic actions and sequences them. This translates into a sequence of writes of data blocks to the socket. The elements in the list are monadic, so execute also must execute those in order to extract each output ByteString.
In this way, even though processRoot is pure at the top level, it can contain effectful computations.
processRoot :: ListOf (UNode IO Text) -> ListOf (UNode IO Text)
processRoot root = do
Element _ _ children <- genericTake 1 root
child <- children
extractElements :: UNode IO Text -> ListOf (UNode IO Text)
extractElements elt | isElement elt = processCommand elt `cons` mzero
extractElements _ = mzero
is a type function that conceals a long-winded type name. This function maps the input document to a list of output nodes.
The root of the input document is actually given as a List containing one item – the top-level XML tag. The reason why we do this is so that we have to ask for it to be pulled. If it were just passed as a UNode IO Text type, we would have to calculate it before the handler was called, and the handler wouldn’t get a chance to do output before it requests input.
The function is implemented using List’s Monad instance, which behaves exactly like a list monad. The reason for genericTake 1 root is so we stop processing after the root node and don’t wait for a node that will never come. I need to fix this in hexpat-iteratee.
`cons` is the generalized list cons operator like : and `mzero` corresponds to .
processCommand :: UNode IO Text -> UNode IO Text
processCommand elt@(Element "title" _ _) = Element "title"  $ joinL $ do
txt <- textContentM elt
return $ search txt
processCommand (Element cmd _ _) = Element "unknown" [("command", cmd)] mzero
Here is our command processor. We have one command <title>foo</title>
that finds all movies whose titles contain foo
joinL is a bit of List magic that lets you drop down into the underlying monad, which in this case is XMLT IO a. joinL’s type is :: ItemM l (l a) -> l a where ItemM l is a type function giving the list’s monad. So, the stuff after joinL resolves to a type of :: XMLT IO (ListOf (UNode IO Text)).
search :: Text -> ListOf (UNode IO Text)
search key = joinL $ do
iter <- liftIO $ parse defaultParserOptions $ \root -> do
let l = do
elt@(Element _ _ children) <- genericTake 1 root
movie <- List.filter isElement children
eMovies <- liftIO $ fileDriver iter "movies.xml"
case eMovies of
Left err -> fail $ "failed to read 'movies.xml': "++show err
Right movies -> return $ List.filter matches movies
matches elt = key `T.isInfixOf` fromMaybe "" (getAttribute elt "title")
Here’s where our handler does some real I/O. We read our database from a flat file using the same method of parsing. Passing possibly unexecuted nodes outside the XMLT monad is a bit wrong, and needs to be addressed in the design, but here it works as long as I execute
them. Alternatively a pure XML parse would work. hexpat
has functions to convert between pure and monadic node types.
So, I build and run the server, and here is the result, using Unix’s nc command as my client. I typed this:
The output is:
<?xml version="1.0" encoding="UTF-8"?>
<movie id="dvzrwfvryd" disc="41" title="War of the Worlds (2005)"
director="Steven Spielberg" genre="Sci Fi Thriller" rating="6"
description="Tom Cruise alert" imdbID="tt0407304"/>
<movie id="xxvjgxpokp" disc="44" title="Shaun of the Dead"
director="Edgar Wright" genre="Comedy Horror" rating="8"
description="British send-up zombie movie" imdbID="tt0365748"/>
<movie id="duvcjsygqi" disc="104" title="March of the Penguins (La Marche de l'empereur)"
director="Luc Jacquet" genre="Documentary" description="" imdbID="tt0428803"/>
<movie id="dawcezoiro" disc="109" title="Pirates of the Caribbean: Dead Man's Chest"
director="Gore Verbinski" genre="Action/Comedy" rating="7" description="" imdbID="tt0383574"/>
(New lines added for readability)
And the session can process more commands interactively.
I should also mention my related hexpat-pickle package which is a shameless rip-off of the picklers from Uwe Schmidt’s excellent hxt package. I find it a very practical and quick way to bang out XML picklers. (It doesn’t work with hexpat-iteratee yet.)
Here’s the code in downloadable form. Make sure you use the monads-fd and transformers packages instead of mtl. Also hexpat-iteratee and text.
I hope you found this interesting. I hope the XML haters of #haskell will be miraculously transformed into XML tolerators, and I hope you’ll help me improve hexpat. – Stephen Blackheath, Manawatu, New Zealand