Running the Scraper¶
As you develop it will be a good idea to run the scraper to ensure that the output JSON is in good shape.
Run the scraper:
$ pupa update seattle
Where seattle
is simply a Python-importable path to your scraper directory. From there, the jurisdiction
object will be able to tell
pupa
where to find the scrapers.
In addition, there are some useful arguments to know about.
Firstly, when doing local testing, --fast
disables Pupa’s scrape throttling,
and uses the scrape_cache
to prevent fetching pages over the line. This is
useful when doing prototyping, but shouldn’t be used regularly, since it puts
more load on these websites, and will read stale data (if your cache stays
around).
Secondly, if don’t have an opencivicdata postgres database set up, it’s useful to pass --scrape
to pupa
, to prevent the --import
and --report
stages from running.
Lastly, being able to restrict which scraper gets run by indicating people
, bills
, events
or votes
after the jurisdiction.
At any point, you can run:
$ pupa update -h
To get most up-to-date information regarding the invocation of Pupa.
Usually, during rapid development, the invocation would look something like:
$ pupa update seattle people --fast
Validating Data¶
After this completes, the data will be in the scraped_data
folder. Each OpenCivic object that gets saved will be written to scraped_data/<jurisdiction_id>/<type>_<tmp_id>.json
.
This object will be a JSON-encoded OpenCivic object, which is a well-documented and defined format for Government data.
By spot-checking a few of the entries, you can check to see if data looks funny, or if things aren’t being categorized properly.
If you want to spot-check some data, using a modern POSIX system should allow you to run something similar to:
$ python -m json.tool $(ls | shuf -n 1) | vim -
Feel free to change vim
to whatever editor you prefer for such tasks.
If you do use vim, there’s a helpful JSON Plugin
Here is an example JSON file you’d get if you run the events scraper we created in Writing an Events Scraper, although note that your IDs will be different:
{
"_id": "efa7ccee-f4d6-11e4-b1eb-843a4bcaaa18",
"agenda": [
{
"description": "Testimony from concerned citizens",
"media": [
{
"date": "",
"links": [
{
"media_type": "application/pdf",
"url": "http://example.com/hearing/testimony.pdf"
}
],
"note": "Written version of testimony"
}
],
"notes": [],
"order": "0",
"related_entities": [
{
"entity_type": "committee",
"name": "Transportation",
"note": "participant"
},
{
"entity_type": "committee",
"name": "Environment and Natural Resources",
"note": "participant"
},
{
"entity_type": "person",
"name": "Jane Brown",
"note": "participant"
},
{
"entity_type": "person",
"name": "Alicia Jones",
"note": "participant"
},
{
"entity_type": "person",
"name": "Fred Green",
"note": "participant"
},
{
"entity_type": "bill",
"name": "HB101",
"note": "consideration"
}
],
"subjects": [
"Transportation",
"Environment"
]
}
],
"all_day": false,
"classification": "event",
"description": "",
"documents": [],
"end_time": null,
"extras": {},
"links": [],
"location": {
"coordinates": null,
"name": "unknown",
"note": ""
},
"media": [
{
"date": "",
"links": [
{
"media_type": "video/mpeg",
"url": "http://example.com/hearing/video.mpg"
}
],
"note": "Video of meeting"
},
{
"date": "",
"links": [
{
"media_type": "application/pdf",
"url": "http://example.com/hearing/minutes.pdf"
}
],
"note": "Meeting minutes"
}
],
"name": "Hearing",
"participants": [
{
"entity_type": "committee",
"name": "Transportation Committee",
"note": "participant"
},
{
"entity_type": "person",
"name": "Joe Smith",
"note": "Hearing Chair"
}
],
"sources": [
{
"note": "",
"url": "http://example.com"
}
],
"start_time": "1776-07-04T17:08:00+00:00",
"status": "confirmed",
"timezone": "US/Pacific"
}