6.4. Example of Parsing HTML Web Pages with Html5lib¶
Writing a program to parse complex web pages can be fairly tricky. The Text Book
discusses parsing web pages containing weather forecasts from
forecast.weather.gov and www.wunderground.com. Both of these pages return fairly complex
HTML code. Last summer, I developed programs that parsed both of these pages.
The HTMLParser module, which is discussed in the book, was not doing a good
job with the complex web pages, so I turned to another module called
html5lib, which is based on the same standard that web browsers like
Fire Fox and Internet Explorer use. Both of my programs just summarized the
forecasts and printed plain text. My original thought last summer was that I
would develop one parser and the students would do the other one. After the
long hours I spent on the first one, I knew that it was not realistic to expect
some students, which are still pretty new to Python and programming in general,
to develop a full parser, even with examples from the text book and my code.
Each page is a little different and requires different logic and string
processing tricks to complete the job. So, after developing the second parser,
I just removed some of the code and asked the students to fill in the missing
code, which is not my prefered way to come up with a programming assignment.
Both of the complete parsers from last year are available for download:
C:\> easy_install html5lib
Also, note that depending on which version of html5lib is
installed, when you run any of these programs you may see as an error
DeprecationWarning: object.__init__() takes no parameters.
This is just a deprecation warning and may be ignored.
It is fixed in version 0.12, which at this time, has not been
made available as a stable release on PyPI.
This year, I was looking to make a change to this assignment and I found that wxPython [wxPython] has a graphics widget that is sort of a mini-web browser for displaying simple HTML formated data. So I thought it would be fun to come up with a graphical version of the weather forecast program.
You may wonder what is the point of parsing HTML and then turning around and generating HTML output. Well, if we want to filter or reformat the content, we must first parse it. We can use HTML parsing to generate clear text data from the HTML, but that is not really the point. The point is to use the known formating to determine which parts of the data have specific meaning so that just the meaningful data can be extracted or that decisions can be made from the data, as is the goal with parsing the zip code data from the US Postal Service web page.
The program that displays the graphical weather forecast is
wxWeather.py. (See Links to Useful Files) Another useful program is
wxHTML.py, which can be used to load and display an HTML file. The
wxWeather.py program also tries to generate a better quick–view of
the weather than last year’s programs did. The previous programs display a
detailed forecast, which is ignored by
wxWeather.py adds a summary of the current conditions, which required
some additional parsing code to obtain.
wxPython requires wxWidgets, which is a C++ library of platform independent graphics widgets that use the native OS graphics facilities. So a program running on Windows will look like any Windows program and the same program run on Linux (GNOME or KDE) or Mac OS X will look a program developed for those platforms. Since wxWidgets are also required, this a case where you should download the appropriate installer for your platform from wxPython.org. If you are using Linux, you might need to compile it from the source code. I have some notes on how to do this, if you need them. [wxPython]
So after I spent many hours developing
wxWeather.py, I was then again
faced with the problem of year ago in terms of what part of the problem to ask
you to write. Then I realized that my program has a problem, which you can
help me fix. My program starts by giving the forecast for Salina. (Feel free
to change that, if you don’t live in Salina.) There is a button to change the
zip code for the weather lookup. But when an invalid zip code is entered, the
program just displays:
Error: Weather Underground, which comes from the
title of the returned page. But the rest of the returned HTML code is totally
different than what my program is looking for. Argh!... Probably the
simplest approach would be to look for the Error string in the title and then
branch to a different parser. But that got me thinking about doing a query on
a zip code database; and a quick Google search, lead to the Postal Service web
page. Reading the HTML and Java script gave clues to the correct URL to
automate the query. See
zipcode.py, where I did some testing to
validate the URL.
So doing your own parser of a fairly simple web page – just to make a True / False logical decision – seems like a better programming assignment than trying to fit your code into my, more complex, parser. A zip code checking function might also be useful for other programming situations you attempt in the years to come. If you look at my code, in the WeatherPanel.load_weather method, you can see where I test that the zip code is five characters long and only contains digits. This is the appropriate place to call your zip code validity function. If you want to, you can create a similar wx.MessageDialog error window as is already done. (You will probably find that wxPython is pretty fun to play with.)
Parse the HTML code returning a tree data structure.
HTML5lib is a HTML parser/tokenizer based on the WHATWG HTML5 specification for
maximum compatibility with major desktop web browsers. See the html5lib
project page [HT5LIB] for examples on
how it may be used. For the weather parsing programs, I just used it to
generate a serial stream of tokens. This may not the best way to use it, but it
is probably the simplest technique. The serial stream approach is similar to
the call-back technique used with
HTMLParser. The only real advantage
in this case of using
html5lib is that the parser is more robust for
handling complex HTML.