Common tips for writing scrapers

The following doc contains a list of useful recipies to help scrape data down from legislative websites. These are by no means the only way to do these things, but it’s a description of some of the things we’ve found to work well.

Fetching a page and setting URLs to absolute paths

It’s handy to be able to set all the relative URL paths to absolute paths. lxml has a pretty neat facility for doing this.

It’s not uncommon to see a method such as:

def lxmlize(self, url):
    entry = self.get(url).text
    page = lxml.html.fromstring(entry)
    page.make_links_absolute(url)
    return page

Getting the current session

We might want to know what the current legislative session is. A legislative session is required for a bill, and can be helpful in limiting the duration of a scrape (for legislatures that have persistent pages, we probably don’t want to scrape all bills/legislators/events back to when they started keeping track!) Sessions are created in __init__.py as a list of dictionaries. Jurisdictions can do all kinds of weird things with sessions (we’ve seen them create sessions inside sessions) so keeping track based on date won’t work. Instead, you’ll need to order sessions chronologically, with the current one on top. For example:

legislative_sessions = [{"identifier":"2015",
                        "name":"2015 Regular Session",
                        "start_date": "2015-01-01",
                        "end_date": "2016-12-31"},
                        {"identifier":"2013",
                        "name":"2013 Regular Session",
                        "start_date": "2013-01-01",
                        "end_date": "2014-12-31"}]

Then to get the current session from any scraper, you can call:

self.jurisdiction.legislative_sessions[0]

Common XPath tricks

The following is a small list of very common tricks hackers use in xpath expressions.

Quick text grabs

Getting text values of HTML elements:

//some-tag/ul/li/text()

Which would be roughly similar to the following pseudo-code:

[x.text for x in page.xpath("//some-tag/ul/li")]

# or, more abstractly:

for el in page.xpath("//some-tag/ul/li"):
    deal_with(el.text)

This is helpful for quickly getting the text values of a bunch of nodes at once without having to call .text on all of them. It’s worth noting that this is different behavior than .text_content().

Class limiting / ID limiting

Sometimes it’s helpful to get particular nodes of a given class or ID:

//some-tag[@class='foo']//div[@id='joe']

This expression will find all div objects with an id of joe (I know, you should only use an id once, but alas sometimes these things happen) that are sub-nodes of a some-tag with a class of foo.

In addition, you can also limit by other things, too, such as text():

//some-other-tag[text()="FULL TEXT"]/*

This will find any some-other-tag tags that contain FULL TEXT as their text() entry. As you can guess, most XPath expressions (etc)

Contains queries

With the above, it’s sometimes needed to search for all class attributes that contain a given string (sometimes sites have quite a bit of autogenerated stuff around an ID or class name, but a substring stays in place)

Let’s take a look at limiting queries:

//some-tag[contains(@class, 'MainDiv')]

This will find any instance of some-tag who’s class contains the substring MainDiv. For example, this will match an element such as <some-tag class='FooBar12394MainDiv333' ></some-div>, but it will not match <some-tag class='FooBarMain123Divsf'></some-div> or a some-tag without a class.

Keep in mind that the @foo can be any attribute of the HTML element, such as @src for an img tag or an @href for an a tag.

Array Access

Warning

Be careful with this one!

You can access indexes of returned lists using square brackets (just like in Python itself), although this tends to not be advised (since the counts can often change, and you may end up scraping in bad data).

However, this is sometimes needed:

//foobar/baz[1]/*

to get all entries under the 1st baz under a foobar. It’s also worth noting that xpath indexes are 1-based not 0-based. Start your counts from 1 not 0 and you’ll have a much better day!

Axis Overview

XPath also features what are known as the “Axis”. The “axis” is a way of selecting other nodes via a given node (which is usually defined by an xpath)

The most useful one is following-sibling or parent

Let’s take a look at following-sibling:

//th[contains(text(), "foo")]/following-sibling::td

This will find any th elements that contain foo in the text(), and search for any td elements which follow the th element.

Or, if we look at a parent relation:

//img[@id='foo']/parent::div[@class='bar']/text()

will fetch the text of a div with a class set to bar who has a sub-node, which is an img with an id set to foo. This expression will continue all the way back up to the root node.

Writing “defensive” scrapers

We tend to write very fragile scrapers - prone to break very loudly (and as soon as we can) when/if the site changes.

As a general rule, if the site has changed, we have a strong chance of pulling in bad data. As a result, we don’t want the scraper to continue on without throwing an error, so that we can be sure bad data never gets imported into the database. We do this by hard-coding very fragile xpaths, which use full names (rather than contains, unless there’s a reason to), and always double-check the incoming data looks sane (or raise an Exception).

One way that’s common to help trigger breakage when table rows get moved around is to unpack the list into variables - this also has an added bonus of being more descriptive in what is where in the row, which aids in debugging a broken scraper. Usually, you’d see something like:

for row in page.xpath("//table[@id='foo']/tr"):
    name, district, email = row.xpath("./*")

Which will trigger breakage if the number of rows change. It still helps to sill assert that you have sane values in such a table, since the order of the entries may change, and you’ll end up changing everyone’s name to “District 5”.

Another common way of doing this is by blindly using an index off an xpath, forcing an IndexError if the index isn’t present. This helps avoid queries where nothing is returned, or too little is returned. You should also be careful to check the len() of the values to ensure too much wasn’t returned as well.

Commonly, scrapers need to normalize and transform bad data into good data (in edge-cases, such as setting party data), and this can be a good place to add a quick check that no data we didn’t expect made it into the database.

Using a dict to index the scraped data is a good way of doing this:

party = {"democrat": "Democratic",
         "republican": "Republican",
         "independent": "Independent"}[scraped_party.lower().strip()]

You can be sure that if the data wasn’t one of the expected 3 that it will raise a KeyError and force someone to ensure the scraped data is (in fact) correct (or if a new party needs to be added).

Since this is infrequent enough, this is a pretty good tradeoff for data quality (and is slightly easier to maintain than a big if/elif/else block).

The end goal here is to make sure that no scraper ever allows bad data into the database. So long as your scraper is doing this, you’ve written a defensive scraper!