How to parse HTML with PYTHON
Parsing HTML with PYTHON
HTML is a markup language with a simple structure. It would be quite easy to build a parser for HTML. In python, there are already solutions available.
Available Python libraries
- HTML Parser of The Standard Library
- Beautiful Soup
HTML Parser of The Standard Library
The standard Python library is quite rich and implements even an HTML Parser. The bad news is that the parser works like a simple and traditional parser, so there are no advanced functionalities geared to handle HTML. The parser essentially makes available a visitor with basic functions for handle the data inside tags, the beginning and the ending of tags.
from html.parser import HTMLParser class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): print("Encountered a start tag:", tag) def handle_endtag(self, tag): print("Encountered an end tag :", tag) def handle_data(self, data): print("Encountered some data :", data) parser = MyHTMLParser() parser.feed('<html><head><title>Test</title></head><body><h1>Parse me!</h1></body></html>')
It works, but it does not really offer anything better than a parser generated by ANTLR or any other generic parser generator.
html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers. Html5lib it is considered a good library to parse HTML5 and a very slow one. Partially because it is written in Python and not in C, like some of the alternatives. By default, the parsing produces an ElementTree tree, but it can be set to create a DOM tree, based on xml.dom.minidom. Html5lib provides walkers that simplify the traversing of the tree and serializers. The following example shows the parser, walker and serializer in action.
html5lib example with walker and serializer
import html5lib element = html5lib.parse('<p xml:lang="pl">Witam wszystkich') walker = html5lib.getTreeWalker("etree") stream = walker(element) s = html5lib.serializer.HTMLSerializer() output = s.serialize(stream) for item in output: print("%r" % item)
It has a sparse documentation.
Html5-parser is a parser for Python but written in C. It also just a parser that produces a tree. It exposes literally one function named parse. The documentation compares it to html5lib, claiming that it is 30x quicker. To produce the output tree, by default, it relies on the library lxml. The same library allows also to pretty print the output. It even refers to the documentation of that library to explain how to navigate the resulting tree.
from html5_parser import parse from lxml.etree import tostring root = parse(some_html) print(tostring(root))
lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language. Lxml is probably the most used low-level parsing library for Python, because of its speed, reliability and features. It is written in Cython, but it relies mostly on the C libraries libxml2 and libxml. Though, this does not mean that it is only a low-level library, but that is also used by other HTML libraries. The library it is designed to work with the ElementTree API, a container for storing XML documents in memory. If you are not familiar with it, the important thing to know it is that it is an old-school way of dealing with (X)HTML.
Basically, you are going to search with XPath and work as if it was the golden age of XML. Fortunately, there is also a specific package for HTML, lxml.html that provide a few features specifically for parsing HTML. The most important one is that support CSS selectors to easily find elements.
There are also many other features, for example:
- it can submit forms
- it provides an internal DSL to create HTML documents
- it can remove unwanted elements from the input, such as script content or CSS style annotations (i.e., it can clean HTML in the semantic sense, eliminating foreign elements)
In short: it can do many things, but not always in the easiest way you can imagine.
import urllib from lxml.html import fromstring url = 'http://microformats.org/' content = urllib.urlopen(url).read() doc = fromstring(content) doc.make_links_absolute(url) # [..] # some handy functions for microformats def get_text(el, class_name): els = el.find_class(class_name) if els: return els.text_content() else: return '' def get_value(el): return get_text(el, 'value') or el.text_content() def get_all_texts(el, class_name): return [e.text_content() for e in els.find_class(class_name)] def parse_addresses(el): # Ideally this would parse street, etc. return el.find_class('adr') # the parsing: for el in doc.find_class('hcard'): card = Card() card.el = el card.fn = get_text(el, 'fn') card.tels =  for tel_el in card.find_class('tel'): card.tels.append(Phone(get_value(tel_el), get_all_texts(tel_el, 'type'))) card.addresses = parse_addresses(el)
However, it is similar to the one used for database queries. The documentation is good enough, though it consists just of what you find in the README of the GitHub project and the following example in the source code.
#!/usr/bin/env python import AdvancedHTMLParser if __name__ == '__main__': parser = AdvancedHTMLParser.AdvancedHTMLParser() parser.parseStr(''' # html text here ''') # Get all items by name items = parser.getElementsByName('items') print ( "Items less than $4.00: ") print ( "-----------------------\n") for item in items: priceEm = item.getElementsByName('price') priceValue = round(float(priceEm.innerHTML.strip()), 2) if priceValue < 4.00: name = priceEm.getPeersByName('itemName').innerHTML.strip() print ( "%s - $%.2f" %(name, priceValue) ) # OUTPUT: # Items less than $4.00: # ----------------------- # # Sponges - $1.96 # Turtles - $3.55 # Coop - $1.44 # Pudding Cups - $1.60
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. As the description on their website reminds you, technically Beautiful Soup it is not properly a parser. In fact, it can use a few parsers behind the scenes, like the standard Python parser or lxml. However, in practical terms, if you are using Python and you need to parse HTML, probably you want to use something like Beautiful Soup to work with HTML.
Beautiful Soup is the go-to library when you need an easy way to parse HTML documents. In terms of features, it might not provide all that you think of, but it probably gives all that you actually need to use. While you can navigate the parse tree yourself, using standard functions, to move around the tree (e.g., next_element, find_parent) you are probably going to use the simplest methods it provides.
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser') # it finds all nodes satisfying the regular expression # and having the matching id soup.find_all(href=re.compile("elsie"), id='link1') # [<a class="sister" href="http://example.com/elsie" id="link1">three</a>] # CSS selectors soup.select("p > a") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
There are a few functions to manipulate the document and easily add or remove elements. For instance, there are a few functions to wrap an element inside a provided one or doing the inverse operation. Beautiful Soup also gives functions to pretty print the output or get only the text of the HTML document.
The documentation is great: there are an explanation and plenty of examples for all features. There is not an official tutorial, but given the quality of the documentation, it is not really needed.