I finally added in the script I've been using to scrape nysenate.gov for a while now. I should have added this to the repo long ago but alas I didn't. By integrating this information into my library I give developers access to information they otherwise wouldn't have and create connections that would otherwise be difficult to follow.
The Long (and detailed)
I just added a file to scrape the nysenate.gov for senator and committee information. This file depends heavily on the awesome BeautifulSoup library for python. It just makes finding information in pages incredibly easy and almost natural, right out of the box.
The scraping process takes a long time though (remote server response times), so I've been using the cPickle standard module to save data in between sessions and save data for library usage once its fully assembled. In the future I'll be looking into multi-threading the scraping process to make it faster and so that I can do multi-threading. I've always wanted to get into that but never really had to before.
Once the information is stored into the .dat file it can be pulled into the Open Legislation Library. This is primarily done by replacing name references to Senators and Committees with their corresponding objects, filled with the scraped data. By doing this we gain access to two main things:
- Committee Relationships: We can now directly find senators by their committee relationships in ways that we could only infer (after much work) before.
- Senator Details: We can also place the senators in context by having direct access to the parties they are affiliated with, the committees they sit on, and the leadership positions they hold. They aren't just another name in the cloud.
Looking Forward
Things are starting to fall in place and things are looking up.
There are still a few fairly major issues that need to be dealt with.
- When search results are pulled down, they don't contain very good information. To get all the useful information you need to make separate queries to the server for each object. This is slow and painful to do manually. Fortunately, I'm fixing this on the other end and the problem will go away shortly.
- You still need to make all your queries through the OpenLegislation client. I'd like to provide hooks into the Objects somehow to automatically create the requests and fill themselves when asked by a user. Still looking into this.
Setup.py
Firmant
I swear, I'm still looking to move over there (and self host my blog), its just not a simple process (yet)...at least as far as I know. Its been busy but I'll get there.
