Writing a Person Scraper¶
This document is meant to provide a tutorial-like overview of the steps toward contributing a municipal Person scraper to the Open Civic Data project.
This guide assumes you have a working pupa setup. If you don’t please refer to the introduction on Writing Scrapers.
Special notes about People scrapers¶
The name is a bit misleading - so-called People scrapers actually scrape
Person
, Organization
and Membership
objects.
The relationship between these three types is so close that they all should be scraped at the same time.
Target Data¶
People scrapers pull in all sorts of information about Organization
Membership
and Person
objects.
The target data commonly includes:
People, and their posts (what bodies they represent)
Alternate names
Current photo
links (homepage, YouTube account, Twitter account)
Contact information (email, physical address, phone number)
Any other identifiers that might be commonly used
Committee memberships
Orgs (committees, etc)
Other names
Commonly used IDs
Contact information for the whole body
Posts (sometimes called seats) on the org
People in each org, and in which seat they sit.
Creating a New Person scraper¶
Our person scraper can be located anywhere, and simply needs to be importable by the __init__.py
so that we can reference it in the get_scraper method. Your scraper can even by located in the __init__.py
file itself if you want to keep things extra simple, but scraper code can eventually get pretty lengthy, so its more scalable to break each scraper out into it’s own file. The default is to put the code in a file called people.py
. Open up that file to see the scraper stub generated by the pupa init program. It should look like this:
from pupa.scrape import Scraper, Person
class SeattlePersonScraper(Scraper):
def scrape(self):
# needs to be implemented
pass
This is the default scraper template, which isn’t very useful yet, but it helps to clarify what the intent of the scraper is. Let’s take a closer look.
In order to scrape people and committees, we’ll use the scrape method that’s been defined in the sample scraper, yielding each Person object. You may also yield an iterable of Person objects, which helps if you are scraping both people and committees for the Jurisdiction, but want to keep the scraper logic in their own routines.
As you might have guessed by now, Person scrapers scrape many People, as well as any Membership objects that you might find along the way.
Let’s take a look at sample working Pupa scraper:
from pupa.scrape import Scraper, Person
class SeattlePersonScraper(Scraper):
def scrape(self):
john = Person(name="John Smith",
district="Position 1",
role="Councilmember",
primary_org="legislature")
john.add_source(url="http://example.com")
yield john
A person requires a name and a membership. The district, role and primary_org fields allow us to find the post to which John Smith is assigned. Recall that we added this post in __init__
. You can go back and add more posts in __init__
if needed. In addition, each entity that’s scraped needs a source, which is added using add_source
.
Committees and Memberships¶
As noted, the People scraper can also handle committees. We can use the following code to add committees:
from pupa.scrape import Scraper, Person, Organization
class SeattlePersonScraper(Scraper):
def scrape(self):
comm = Organization(name="Transportation Committee",
classification="committee",
chamber="legislature")
comm.add_source(url="http://example.com/committtees/transit")
yield comm
And we might want to add relationships between people and committees. The Person object initializer automatically creates a relationship between a person and his/her primary organization, but if we want to make John Smith a member of the Transportation Committee, we can use the Organization’s add_member
method. The full script is as follows:
from pupa.scrape import Scraper, Person, Organization
class SeattlePersonScraper(Scraper):
def scrape(self):
doc = self.get("http://www.sunlightfoundation.com")
john = Person(name="John Smith",
district="Position 1",
role="Councilmember",
primary_org="legislature")
john.add_source(url="http://example.com")
yield john
comm = Organization(name="Transportation Committee",
classification="committee",
chamber="legislature")
comm.add_source(url="http://example.com/committtees/transit")
comm.add_member(john,role="chair")
yield comm
Scraper Example¶
Of course, in real scrapers, you’ll need to write some code to take care of getting the list of people that are in that jurisdiction, or have memberships in the Legislature. Hardcoding names, such as in the examples above doesn’t do much for us, since we won’t be able to capture the current state of the world.
As a slightly more fun example, here’s a scraper that will scrape the Sunlight website for people’s information. This is deliberately a mildly complex example (as well as being purely for fun!), to get a feel for what a working Person scraper may look like. Note that we’re assuming that Sunlight is a committee of the United States. Here’s the __init__.py
contents:
from pupa.scrape import Jurisdiction, Organization
from .people import UsaPersonScraper
class Usa(Jurisdiction):
division_id = "ocd-division/country:us"
classification = "committee"
name = "United States"
url = "www.sunlightfoundation.com"
scrapers = {
"people": UsaPersonScraper,
}
def get_organizations(self):
org = Organization(name="Sunlight Foundation", classification="committee")
org.add_post(label="president", role="president")
org.add_post(label="co-founder", role="co-founder")
org.add_post(label="staff", role="staff")
org.add_post(label="fellow", role="fellow")
org.add_post(label="consultant", role="consultant")
org.add_post(label="intern", role="intern")
org.add_source("www.sunlightfoundation.com")
yield org
And here’s our people scraper:
from pupa.scrape import Scraper, Person
import lxml.html
class UsaPersonScraper(Scraper):
def scrape(self):
url = "http://sunlightfoundation.com/team/"
entry = self.get(url).text
page = lxml.html.fromstring(entry)
page.make_links_absolute(url)
for position in page.xpath("//ul[contains(@class,'sunlightStaff')]/li"):
position_name = position.xpath('.//h3')[0].text
position_name = position_name.replace("Sunlight","").strip()
position_name = position_name.rstrip("s")
for person in position.xpath(".//li"):
name = person.xpath(".//span")[0].text.strip()
homepage = person.xpath("..//a/@href")[0]
member = Person(name=name,
role=position_name,
primary_org="committee")
member.add_link(homepage)
member.add_source(url)
yield member
Special notes regarding Posts, Memberships and Districts¶
The keen observer will note that we’re using role, district and primary_org to note the person’s primary position.
Looking at the Popolo spec, you might be confused on why this isn’t an opaque ID, or some sort of slug.
We use full strings to help avoid having to search through all available organizations at scrape-time. The resolution is done at import-time.