Parsing HTML with PYTHON

8 minute read

Published:

How to parse HTML with PYTHON

Parsing HTML with PYTHON

HTML is a markup language with a simple structure. It would be quite easy to build a parser for HTML. In python, there are already solutions available.

Available Python libraries

  • HTML Parser of The Standard Library
  • Html5lib
  • Html5-parser
  • Lxml
  • AdvancedHTMLParser
  • Beautiful Soup

HTML Parser of The Standard Library

The standard Python library is quite rich and implements even an HTML Parser. The bad news is that the parser works like a simple and traditional parser, so there are no advanced functionalities geared to handle HTML. The parser essentially makes available a visitor with basic functions for handle the data inside tags, the beginning and the ending of tags.

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
       print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
           print("Encountered an end tag :", tag)
    def handle_data(self, data):
       print("Encountered some data  :", data)

parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head><body><h1>Parse me!</h1></body></html>')

It works, but it does not really offer anything better than a parser generated by ANTLR or any other generic parser generator.

Html5lib

html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers. Html5lib it is considered a good library to parse HTML5 and a very slow one. Partially because it is written in Python and not in C, like some of the alternatives. By default, the parsing produces an ElementTree tree, but it can be set to create a DOM tree, based on xml.dom.minidom. Html5lib provides walkers that simplify the traversing of the tree and serializers. The following example shows the parser, walker and serializer in action.

html5lib example with walker and serializer

import html5lib

element = html5lib.parse('<p xml:lang="pl">Witam wszystkich')
walker = html5lib.getTreeWalker("etree")
stream = walker(element)
s = html5lib.serializer.HTMLSerializer()
output = s.serialize(stream)
for item in output:
  print("%r" % item)

It has a sparse documentation.

Html5-parser

Html5-parser is a parser for Python but written in C. It also just a parser that produces a tree. It exposes literally one function named parse. The documentation compares it to html5lib, claiming that it is 30x quicker. To produce the output tree, by default, it relies on the library lxml. The same library allows also to pretty print the output. It even refers to the documentation of that library to explain how to navigate the resulting tree.

from html5_parser import parse
from lxml.etree import tostring
root = parse(some_html)
print(tostring(root))

Lxml

lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language. Lxml is probably the most used low-level parsing library for Python, because of its speed, reliability and features. It is written in Cython, but it relies mostly on the C libraries libxml2 and libxml. Though, this does not mean that it is only a low-level library, but that is also used by other HTML libraries. The library it is designed to work with the ElementTree API, a container for storing XML documents in memory. If you are not familiar with it, the important thing to know it is that it is an old-school way of dealing with (X)HTML.

Basically, you are going to search with XPath and work as if it was the golden age of XML. Fortunately, there is also a specific package for HTML, lxml.html that provide a few features specifically for parsing HTML. The most important one is that support CSS selectors to easily find elements.

There are also many other features, for example:

  • it can submit forms
  • it provides an internal DSL to create HTML documents
  • it can remove unwanted elements from the input, such as script content or CSS style annotations (i.e., it can clean HTML in the semantic sense, eliminating foreign elements)

In short: it can do many things, but not always in the easiest way you can imagine.

import urllib
from lxml.html import fromstring
url = 'http://microformats.org/'
content = urllib.urlopen(url).read()
doc = fromstring(content)
doc.make_links_absolute(url)
# [..]
# some handy functions for microformats
def get_text(el, class_name):
    els = el.find_class(class_name)
    if els:
        return els[0].text_content()
    else:
        return ''
def get_value(el):
    return get_text(el, 'value') or el.text_content()
def get_all_texts(el, class_name):
    return [e.text_content() for e in els.find_class(class_name)]
def parse_addresses(el):
    # Ideally this would parse street, etc.
    return el.find_class('adr')
# the parsing:
for el in doc.find_class('hcard'):
    card = Card()
    card.el = el
    card.fn = get_text(el, 'fn')
    card.tels = []
    for tel_el in card.find_class('tel'):
        card.tels.append(Phone(get_value(tel_el),
                               get_all_texts(tel_el, 'type')))
    card.addresses = parse_addresses(el)

AdvancedHTMLParser

AdvancedHTMLParser is a Python parser that aims to reproduce the behaviour of raw JavaScript in Python. By raw JavaScript, I mean without jQuery or CSS selector syntax. So, it builds a DOM-like representation that you can interact with. If it works in HTML javascript on a tag element, it should work on an AdvancedTag element with python. The parser also adds a few additional features. For instance, it supports direct modification of attributes (e.g., tag.id = “nope”) instead of using the JavaScript-like syntax (e.g., setAttribute function). It can also perform a basic validation of an HTML document (i.e., check for missing closing tokens) and output a prettified HTML. The most important addition, though, is the support for advanced search and filtering methods for tags. The method find search value and attributes, while filter is more advanced. The second one depends on another library called QueryableList, which is described as “ORM-style filtering to any list of items”. It is not as powerful as XPath or CSS selectors and it does not use a familiar syntax for HTML manipulation.

However, it is similar to the one used for database queries. The documentation is good enough, though it consists just of what you find in the README of the GitHub project and the following example in the source code.

#!/usr/bin/env python
import AdvancedHTMLParser
if __name__ == '__main__':
parser = AdvancedHTMLParser.AdvancedHTMLParser()
parser.parseStr('''
    # html text here
     ''')
# Get all items by name
    items = parser.getElementsByName('items')
    
    print ( "Items less than $4.00: ")
    print ( "-----------------------\n")
    
    for item in items:
        priceEm = item.getElementsByName('price')[0]
priceValue = round(float(priceEm.innerHTML.strip()), 2)
        if priceValue < 4.00:
            name = priceEm.getPeersByName('itemName')[0].innerHTML.strip()
print ( "%s - $%.2f" %(name, priceValue) )
# OUTPUT:
# Items less than $4.00: 
# -----------------------
# 
# Sponges - $1.96
# Turtles - $3.55
# Coop - $1.44
# Pudding Cups - $1.60

Beautiful Soup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. As the description on their website reminds you, technically Beautiful Soup it is not properly a parser. In fact, it can use a few parsers behind the scenes, like the standard Python parser or lxml. However, in practical terms, if you are using Python and you need to parse HTML, probably you want to use something like Beautiful Soup to work with HTML.

Beautiful Soup is the go-to library when you need an easy way to parse HTML documents. In terms of features, it might not provide all that you think of, but it probably gives all that you actually need to use. While you can navigate the parse tree yourself, using standard functions, to move around the tree (e.g., next_element, find_parent) you are probably going to use the simplest methods it provides.

The first are CSS selectors, to easily select the needed elements of the document. But there are also simpler functions to find elements according to their name or directly accessing the tags (e.g., title). They are both quite powerful, but the first will be more familiar to users of JavaScript, while the other is more pythonic.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
# it finds all nodes satisfying the regular expression
# and having the matching id
soup.find_all(href=re.compile("elsie"), id='link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]
# CSS selectors
soup.select("p > a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

There are a few functions to manipulate the document and easily add or remove elements. For instance, there are a few functions to wrap an element inside a provided one or doing the inverse operation. Beautiful Soup also gives functions to pretty print the output or get only the text of the HTML document.

The documentation is great: there are an explanation and plenty of examples for all features. There is not an official tutorial, but given the quality of the documentation, it is not really needed.