Common tips for writing scrapers¶
The following doc contains a list of useful recipies to help scrape data down from legislative websites. These are by no means the only way to do these things, but it’s a description of some of the things we’ve found to work well.
Fetching a page and setting URLs to absolute paths¶
It’s handy to be able to set all the relative URL paths to absolute paths.
lxml
has a pretty neat facility for doing this.
It’s not uncommon to see a method such as:
def lxmlize(self, url):
entry = self.get(url).text
page = lxml.html.fromstring(entry)
page.make_links_absolute(url)
return page
Getting the current session¶
We might want to know what the current legislative session is. A legislative session is required for a bill, and can be helpful in limiting the duration of a scrape (for legislatures that have persistent pages, we probably don’t want to scrape all bills/legislators/events back to when they started keeping track!) Sessions are created in __init__.py
as a list of dictionaries. Jurisdictions can do all kinds of weird things with sessions (we’ve seen them create sessions inside sessions) so keeping track based on date won’t work. Instead, you’ll need to order sessions chronologically, with the current one on top. For example:
legislative_sessions = [{"identifier":"2015",
"name":"2015 Regular Session",
"start_date": "2015-01-01",
"end_date": "2016-12-31"},
{"identifier":"2013",
"name":"2013 Regular Session",
"start_date": "2013-01-01",
"end_date": "2014-12-31"}]
Then to get the current session from any scraper, you can call:
self.jurisdiction.legislative_sessions[0]
Common XPath tricks¶
The following is a small list of very common tricks hackers use in xpath
expressions.
Quick text grabs¶
Getting text values of HTML elements:
//some-tag/ul/li/text()
Which would be roughly similar to the following pseudo-code:
[x.text for x in page.xpath("//some-tag/ul/li")]
# or, more abstractly:
for el in page.xpath("//some-tag/ul/li"):
deal_with(el.text)
This is helpful for quickly getting the text values of a bunch of nodes at once
without having to call .text
on all of them. It’s worth noting that
this is different behavior than .text_content()
.
Class limiting / ID limiting¶
Sometimes it’s helpful to get particular nodes of a given class or ID:
//some-tag[@class='foo']//div[@id='joe']
This expression will find all div
objects with an id
of joe
(I know,
you should only use an id
once, but alas sometimes these things happen)
that are sub-nodes of a some-tag
with a class
of foo
.
In addition, you can also limit by other things, too, such as text()
:
//some-other-tag[text()="FULL TEXT"]/*
This will find any some-other-tag
tags that contain FULL TEXT
as their
text()
entry. As you can guess, most XPath expressions (etc)
Contains queries¶
With the above, it’s sometimes needed to search for all class
attributes
that contain a given string (sometimes sites have quite a bit of autogenerated
stuff around an ID or class name, but a substring stays in place)
Let’s take a look at limiting queries:
//some-tag[contains(@class, 'MainDiv')]
This will find any instance of some-tag
who’s class
contains the
substring MainDiv
. For example, this will match an element such
as <some-tag class='FooBar12394MainDiv333' ></some-div>
, but it will
not match <some-tag class='FooBarMain123Divsf'></some-div>
or a
some-tag
without a class.
Keep in mind that the @foo
can be any attribute of the HTML element,
such as @src
for an img
tag or an @href
for an a
tag.
Array Access¶
Warning
Be careful with this one!
You can access indexes of returned lists using square brackets (just like in Python itself), although this tends to not be advised (since the counts can often change, and you may end up scraping in bad data).
However, this is sometimes needed:
//foobar/baz[1]/*
to get all entries under the 1st baz
under a foobar
. It’s also worth
noting that xpath indexes are 1-based not 0-based. Start your counts from
1
not 0
and you’ll have a much better day!
Axis Overview¶
XPath also features what are known as the “Axis”. The “axis” is a way of selecting other nodes via a given node (which is usually defined by an xpath)
The most useful one is following-sibling
or parent
Let’s take a look at following-sibling
:
//th[contains(text(), "foo")]/following-sibling::td
This will find any th
elements that contain foo
in the text()
,
and search for any td
elements which follow the th
element.
Or, if we look at a parent
relation:
//img[@id='foo']/parent::div[@class='bar']/text()
will fetch the text of a div
with a class
set to bar
who has a
sub-node, which is an img
with an id
set to foo
. This expression
will continue all the way back up to the root node.
Writing “defensive” scrapers¶
We tend to write very fragile scrapers - prone to break very loudly (and as soon as we can) when/if the site changes.
As a general rule, if the site has changed, we have a strong chance of
pulling in bad data. As a result, we don’t want the scraper to continue
on without throwing an error, so that we can be sure bad data never gets
imported into the database. We do this by hard-coding very fragile xpaths,
which use full names (rather than contains, unless there’s a reason to),
and always double-check the incoming data looks sane (or raise an
Exception
).
One way that’s common to help trigger breakage when table rows get moved around is to unpack the list into variables - this also has an added bonus of being more descriptive in what is where in the row, which aids in debugging a broken scraper. Usually, you’d see something like:
for row in page.xpath("//table[@id='foo']/tr"):
name, district, email = row.xpath("./*")
Which will trigger breakage if the number of rows change. It still helps to sill assert that you have sane values in such a table, since the order of the entries may change, and you’ll end up changing everyone’s name to “District 5”.
Another common way of doing this is by blindly using an index off an xpath,
forcing an IndexError
if the index isn’t present. This helps avoid
queries where nothing is returned, or too little is returned. You should also
be careful to check the len()
of the values to ensure too much wasn’t
returned as well.
Commonly, scrapers need to normalize and transform bad data into good data (in
edge-cases, such as setting party
data), and this can be a good place
to add a quick check that no data we didn’t expect made it into the database.
Using a dict to index the scraped data is a good way of doing this:
party = {"democrat": "Democratic",
"republican": "Republican",
"independent": "Independent"}[scraped_party.lower().strip()]
You can be sure that if the data wasn’t one of the expected 3 that it will
raise a KeyError
and force someone to ensure the scraped data is
(in fact) correct (or if a new party needs to be added).
Since this is infrequent enough, this is a pretty good tradeoff for data
quality (and is slightly easier to maintain than a big if
/elif
/else
block).
The end goal here is to make sure that no scraper ever allows bad data into the database. So long as your scraper is doing this, you’ve written a defensive scraper!