Writing a Person Scraper

This document is meant to provide a tutorial-like overview of the steps toward contributing a municipal Person scraper to the Open Civic Data project.

This guide assumes you have a working pupa setup. If you don’t please refer to the introduction on Writing Scrapers.

Special notes about People scrapers

The name is a bit misleading - so-called People scrapers actually scrape Person, Organization and Membership objects.

The relationship between these three types is so close that they all should be scraped at the same time.

Target Data

People scrapers pull in all sorts of information about Organization Membership and Person objects.

The target data commonly includes:

  • People, and their posts (what bodies they represent)

    • Alternate names

    • Current photo

    • links (homepage, YouTube account, Twitter account)

    • Contact information (email, physical address, phone number)

    • Any other identifiers that might be commonly used

    • Committee memberships

  • Orgs (committees, etc)

    • Other names

    • Commonly used IDs

    • Contact information for the whole body

    • Posts (sometimes called seats) on the org

    • People in each org, and in which seat they sit.

Creating a New Person scraper

Our person scraper can be located anywhere, and simply needs to be importable by the __init__.py so that we can reference it in the get_scraper method. Your scraper can even by located in the __init__.py file itself if you want to keep things extra simple, but scraper code can eventually get pretty lengthy, so its more scalable to break each scraper out into it’s own file. The default is to put the code in a file called people.py. Open up that file to see the scraper stub generated by the pupa init program. It should look like this:

from pupa.scrape import Scraper, Person


class SeattlePersonScraper(Scraper):

    def scrape(self):
        # needs to be implemented
        pass

This is the default scraper template, which isn’t very useful yet, but it helps to clarify what the intent of the scraper is. Let’s take a closer look.

In order to scrape people and committees, we’ll use the scrape method that’s been defined in the sample scraper, yielding each Person object. You may also yield an iterable of Person objects, which helps if you are scraping both people and committees for the Jurisdiction, but want to keep the scraper logic in their own routines.

As you might have guessed by now, Person scrapers scrape many People, as well as any Membership objects that you might find along the way.

Let’s take a look at sample working Pupa scraper:

from pupa.scrape import Scraper, Person
class SeattlePersonScraper(Scraper):
    def scrape(self):
        john = Person(name="John Smith",
        district="Position 1",
        role="Councilmember",
        primary_org="legislature")
        john.add_source(url="http://example.com")
        yield john

A person requires a name and a membership. The district, role and primary_org fields allow us to find the post to which John Smith is assigned. Recall that we added this post in __init__. You can go back and add more posts in __init__ if needed. In addition, each entity that’s scraped needs a source, which is added using add_source.

Committees and Memberships

As noted, the People scraper can also handle committees. We can use the following code to add committees:

from pupa.scrape import Scraper, Person, Organization
class SeattlePersonScraper(Scraper):
    def scrape(self):
        comm = Organization(name="Transportation Committee",
                            classification="committee",
                            chamber="legislature")

        comm.add_source(url="http://example.com/committtees/transit")
        yield comm

And we might want to add relationships between people and committees. The Person object initializer automatically creates a relationship between a person and his/her primary organization, but if we want to make John Smith a member of the Transportation Committee, we can use the Organization’s add_member method. The full script is as follows:

from pupa.scrape import Scraper, Person, Organization


class SeattlePersonScraper(Scraper):

    def scrape(self):
        doc = self.get("http://www.sunlightfoundation.com")
        john = Person(name="John Smith",
            district="Position 1",
            role="Councilmember",
            primary_org="legislature")
        john.add_source(url="http://example.com")
        yield john

        comm = Organization(name="Transportation Committee",
                                classification="committee",
                                chamber="legislature")
        comm.add_source(url="http://example.com/committtees/transit")
        comm.add_member(john,role="chair")
        yield comm

Scraper Example

Of course, in real scrapers, you’ll need to write some code to take care of getting the list of people that are in that jurisdiction, or have memberships in the Legislature. Hardcoding names, such as in the examples above doesn’t do much for us, since we won’t be able to capture the current state of the world.

As a slightly more fun example, here’s a scraper that will scrape the Sunlight website for people’s information. This is deliberately a mildly complex example (as well as being purely for fun!), to get a feel for what a working Person scraper may look like. Note that we’re assuming that Sunlight is a committee of the United States. Here’s the __init__.py contents:

from pupa.scrape import Jurisdiction, Organization
from .people import UsaPersonScraper

class Usa(Jurisdiction):
    division_id = "ocd-division/country:us"
    classification = "committee"
    name = "United States"
    url = "www.sunlightfoundation.com"
    scrapers = {
        "people": UsaPersonScraper,
    }

    def get_organizations(self):
        org = Organization(name="Sunlight Foundation", classification="committee")

        org.add_post(label="president", role="president")
        org.add_post(label="co-founder", role="co-founder")
        org.add_post(label="staff", role="staff")
        org.add_post(label="fellow", role="fellow")
        org.add_post(label="consultant", role="consultant")
        org.add_post(label="intern", role="intern")

        org.add_source("www.sunlightfoundation.com")

        yield org

And here’s our people scraper:

from pupa.scrape import Scraper, Person
import lxml.html

class UsaPersonScraper(Scraper):

    def scrape(self):
        url = "http://sunlightfoundation.com/team/"
        entry = self.get(url).text
        page = lxml.html.fromstring(entry)
        page.make_links_absolute(url)

        for position in page.xpath("//ul[contains(@class,'sunlightStaff')]/li"):
            position_name = position.xpath('.//h3')[0].text
            position_name = position_name.replace("Sunlight","").strip()
            position_name = position_name.rstrip("s")

            for person in position.xpath(".//li"):
                name = person.xpath(".//span")[0].text.strip()
                homepage = person.xpath("..//a/@href")[0]
                member = Person(name=name,
                    role=position_name,
                    primary_org="committee")
                member.add_link(homepage)
                member.add_source(url)
                yield member

Special notes regarding Posts, Memberships and Districts

The keen observer will note that we’re using role, district and primary_org to note the person’s primary position.

Looking at the Popolo spec, you might be confused on why this isn’t an opaque ID, or some sort of slug.

We use full strings to help avoid having to search through all available organizations at scrape-time. The resolution is done at import-time.