6.4. Example of Parsing HTML Web Pages with Html5lib

6.4.1. Background

Writing a program to parse complex web pages can be fairly tricky. The Text Book discusses parsing web pages containing weather forecasts from forecast.weather.gov and www.wunderground.com. Both of these pages return fairly complex HTML code. Last summer, I developed programs that parsed both of these pages. The HTMLParser module, which is discussed in the book, was not doing a good job with the complex web pages, so I turned to another module called html5lib, which is based on the same standard that web browsers like Fire Fox and Internet Explorer use. Both of my programs just summarized the forecasts and printed plain text. My original thought last summer was that I would develop one parser and the students would do the other one. After the long hours I spent on the first one, I knew that it was not realistic to expect some students, which are still pretty new to Python and programming in general, to develop a full parser, even with examples from the text book and my code. Each page is a little different and requires different logic and string processing tricks to complete the job. So, after developing the second parser, I just removed some of the code and asked the students to fill in the missing code, which is not my prefered way to come up with a programming assignment. Both of the complete parsers from last year are available for download: weatherForecast.py and wuWeatherForecast.py.

The required module, html5lib, may be installed using easy_install from the setuptools package as was done with dnspython. (See Installation of Python Packages):

C:\> easy_install html5lib

Also, note that depending on which version of html5lib is installed, when you run any of these programs you may see as an error message stating: DeprecationWarning: object.__init__() takes no parameters. This is just a deprecation warning and may be ignored. It is fixed in version 0.12, which at this time, has not been made available as a stable release on PyPI.

This year, I was looking to make a change to this assignment and I found that wxPython [wxPython] has a graphics widget that is sort of a mini-web browser for displaying simple HTML formated data. So I thought it would be fun to come up with a graphical version of the weather forecast program.


You may wonder what is the point of parsing HTML and then turning around and generating HTML output. Well, if we want to filter or reformat the content, we must first parse it. We can use HTML parsing to generate clear text data from the HTML, but that is not really the point. The point is to use the known formating to determine which parts of the data have specific meaning so that just the meaningful data can be extracted or that decisions can be made from the data, as is the goal with parsing the zip code data from the US Postal Service web page.

The program that displays the graphical weather forecast is wxWeather.py. (See Links to Useful Files) Another useful program is wxHTML.py, which can be used to load and display an HTML file. The wxWeather.py program also tries to generate a better quick–view of the weather than last year’s programs did. The previous programs display a detailed forecast, which is ignored by wxWeather.py. However, wxWeather.py adds a summary of the current conditions, which required some additional parsing code to obtain.


wxPython requires wxWidgets, which is a C++ library of platform independent graphics widgets that use the native OS graphics facilities. So a program running on Windows will look like any Windows program and the same program run on Linux (GNOME or KDE) or Mac OS X will look a program developed for those platforms. Since wxWidgets are also required, this a case where you should download the appropriate installer for your platform from wxPython.org. If you are using Linux, you might need to compile it from the source code. I have some notes on how to do this, if you need them. [wxPython]

So after I spent many hours developing wxWeather.py, I was then again faced with the problem of year ago in terms of what part of the problem to ask you to write. Then I realized that my program has a problem, which you can help me fix. My program starts by giving the forecast for Salina. (Feel free to change that, if you don’t live in Salina.) There is a button to change the zip code for the weather lookup. But when an invalid zip code is entered, the program just displays: Error: Weather Underground, which comes from the title of the returned page. But the rest of the returned HTML code is totally different than what my program is looking for. Argh!... Probably the simplest approach would be to look for the Error string in the title and then branch to a different parser. But that got me thinking about doing a query on a zip code database; and a quick Google search, lead to the Postal Service web page. Reading the HTML and Java script gave clues to the correct URL to automate the query. See zipcode.py, where I did some testing to validate the URL.

So doing your own parser of a fairly simple web page – just to make a True / False logical decision – seems like a better programming assignment than trying to fit your code into my, more complex, parser. A zip code checking function might also be useful for other programming situations you attempt in the years to come. If you look at my code, in the WeatherPanel.load_weather method, you can see where I test that the zip code is five characters long and only contains digits. This is the appropriate place to call your zip code validity function. If you want to, you can create a similar wx.MessageDialog error window as is already done. (You will probably find that wxPython is pretty fun to play with.)

6.4.3. HTML5lib

class html5lib.HTMLParser

The parser object from html5lib.


Parse the HTML code returning a tree data structure.

HTML5lib is a HTML parser/tokenizer based on the WHATWG HTML5 specification for maximum compatibility with major desktop web browsers. See the html5lib project page [HT5LIB] for examples on how it may be used. For the weather parsing programs, I just used it to generate a serial stream of tokens. This may not the best way to use it, but it is probably the simplest technique. The serial stream approach is similar to the call-back technique used with HTMLParser. The only real advantage in this case of using html5lib is that the parser is more robust for handling complex HTML.