Software Development with Linux

Parsing XHTML in C

THU, 01 OCT 2009

Recently, I had to build an (X)HTML parser in C.  Instead of re-inventing the wheel, I looked at what already existed.  There seems to be only 3 possibilities :


After a quick look around, it seems as if libwww hasn't been updated since 2002.  As for HTML Tidy, it doesn't look like the parser could be easily re-used.  The HTMLparser of libxml2 was more recent and packaged as a nice library, so I went with it for my first tests.

I've been surprised by libxml2's HTMLparser.  I didn't found a web page which it wasn't able to parse.  Sure, plenty of error were found on many web pages, but that didn't stop the parser from providing me the information I was looking for.

The only thing lacking is a good tutorial, I didn't found any.  So, I'll polish the one I've done while trying that library and post it here as soon as possible.  Hopefully, it will help others use that library.

Who said you couldn't easily parse (X)HTML in C?