Monday, July 19, 2010

Scraper.py

The Short (and sweet)

I finally added in the script I've been using to scrape nysenate.gov for a while now. I should have added this to the repo long ago but alas I didn't. By integrating this information into my library I give developers access to information they otherwise wouldn't have and create connections that would otherwise be difficult to follow.

The Long (and detailed)

I just added a file to scrape the nysenate.gov for senator and committee information. This file depends heavily on the awesome BeautifulSoup library for python. It just makes finding information in pages incredibly easy and almost natural, right out of the box.

The scraping process takes a long time though (remote server response times), so I've been using the cPickle standard module to save data in between sessions and save data for library usage once its fully assembled. In the future I'll be looking into multi-threading the scraping process to make it faster and so that I can do multi-threading. I've always wanted to get into that but never really had to before.

Once the information is stored into the .dat file it can be pulled into the Open Legislation Library. This is primarily done by replacing name references to Senators and Committees with their corresponding objects, filled with the scraped data. By doing this we gain access to two main things:
  • Committee Relationships: We can now directly find senators by their committee relationships in ways that we could only infer (after much work) before.
  • Senator Details: We can also place the senators in context by having direct access to the parties they are affiliated with, the committees they sit on, and the leadership positions they hold. They aren't just another name in the cloud.
Looking Forward

Things are starting to fall in place and things are looking up.
There are still a few fairly major issues that need to be dealt with.
  • When search results are pulled down, they don't contain very good information. To get all the useful information you need to make separate queries to the server for each object. This is slow and painful to do manually. Fortunately, I'm fixing this on the other end and the problem will go away shortly.
  • You still need to make all your queries through the OpenLegislation client. I'd like to provide hooks into the Objects somehow to automatically create the requests and fill themselves when asked by a user. Still looking into this.
Setup.py

Those issues aside, the library is in perfectly good usable condition. I need to finish up my setup.py file so I can add myself to pypi. I'll probably speak with Rob about this, he's been very helpful throughout this summer.

Firmant

I swear, I'm still looking to move over there (and self host my blog), its just not a simple process (yet)...at least as far as I know. Its been busy but I'll get there.

Friday, July 9, 2010

Upcoming

This blog will be moving to Firmant in the upcoming days such that I can be of assistance to Rob Escriva in its development and explore a different kind of blogging experience.

I'll keep everyone posted.

Monday, June 7, 2010

Documentation and Accessability

Documentation Progress

So as I mentioned before, I finished up a version of the OpenLegislation library that works really well for bills and transcripts (but transcripts are boring right now) in the last couple days. Since then I've further developed the documentation to have an examples page, uploaded it to my web server, and created links between my git repo and my documentation.

Accessability

In order to make my script more accessible I am looking into creating a setup.py file to enable my library to be distributed through PyPi and the easy_install interface. Python has a really cool framework for doing this type of thing, I hear Rob will be posting some of his insight into this area soon. I'll put up my thoughts when I've got more experience with.

Web App?

Then I got thinking, why make people install python and your library just to try it out? Could I make the library available in a command line type environment on a web app for anyone to try?

I consulted the #RCOS IRC (always a good idea) and Moorthy thought it was an interesting idea. Rob suggested I use pyparser to create a domain specific language (DSL) that covers the use cases for my library. Good idea.

So today I learned to use pyparser (cool tool) and developed a fairly nice and clean language for recognizing library calls. I even had time to hook that parsing up with eval expressions so I could translate strings matching the DSL into actual commands and store/output the results!

In the upcoming days I'm going to look at having some persistence tied to browser sessions and pushing things back and forth with Ajax. I wasn't initially sure how viable this idea might be, but it looks like its more viable than I thought. If all works out well from here on out I think I'll have taken a big step towards try it before you buy it!

Details on how I did what? I'm still figuring things out, I'll put them up when I get parts done, (hopefully in the next couple days).Link

Saturday, June 5, 2010

NYSS OpenLegislation Library

Its been a long time since I've posted an update here.

Open Legislation Library

Just today I split the OpenLegislation Library off into a separate project since people in the NY Senate will now be looking at/using it. You can find it here. I've found out at least one other person is writing a library for OpenLegislation as well and hope to get in contact with him.

Documentation

I've also prepared what (I think anyway) is some pretty solid documentation on the libary which can be found in the docs/build/html folder of the repo. I built it using a tool called Sphinx which is very awesome and satisfied (almost) every need easily and out of the box. Not to mention produces documentation that looks clean, consistent, and professional. I'd definitely recommend it to others as a python documentation tool of choice.

Web App?

Since not everyone wants to pull it down (with git) and set up a python environment to play with the library, I am thinking about making a simulated environment in a web app so people can play around with it and see how it works (and how easy it is). I'm not sure how difficult this would be but I'm going to look into it.

Floodlight

I've started work on the Floodlight API now. First step was scrapping some information not available in the OpenLegislation API off of the Senate Drupal site with Beautiful Soup. Let me make special note of the fact that Beautiful Soup is AWESOME. I could not have asked for a cleaner or easier way to scrape data off those web pages. So far I've scraped a senator list, a committee list, and a list of senators on each committee (as well as the committee chair). I'm currently working to scrape all the senator contact information but its got really wild and inconsistent formatting that is making it difficult and full of special cases. Its getting there though.

I'm not sure what other information I should scrape off but if I think if it, Beautiful Soup will be there to make it easy.

Restructuring Data

Now that the OpenLegislation library is in good usable condition I will be using it to do a batch job and pull all of OpenLegislation's information and restructuring for local storage. I think that a lot can be done simply by providing better data organization and linkages. Even more can by done by inserting the information I've scraped into the mix and by providing calculated statistics in some of the views. What exactly I plan to do in this area will largely be the subject of next week's work.

Spreading the Word

In the meantime, I'm going to work on spreading the word in the CIO office that the library has developed and have some of the developers working on/with OpenLegislation take a look at it. I'd encourage anyone else to head over and take a look at it too. Especially the documentation.

Thursday, April 15, 2010

Updates and the RCOS Mid-Semester Presentation

So I just put together our presentation for tomorrow's RCOS meeting. There are a lot of presentations slated for the meeting, so greater elaborations on some of the topics will be posted on the blog in the coming days in lieu of having time to present them to the group.

By the way, to avoid confusion: we're renaming the project very slightly to FloodLight instead of Floodlight because we thought it looked a bit better. The final web application using all of our back end tools will however be called FloodlightProject.org (FloodLight.org was already taken, oh well).

Our presentation is also available for viewing on SlideShare as shown below:
As a note, we did mean to have our first presentation up when we first posted it, but Graylin had issues uploading it. For your viewing pleasure, I've uploaded it and included it in this post as follows.

Monday, April 12, 2010

Clarifications

One of the major difficulties we've had with Floodlight to date lie has been making sense of some of the legislative terminology used in NYS Senate and thus OpenLegislation.

One feature of bills in OpenLegislation is their action list. Actions are what they sound like, and they iterate every step that the bill moves through, from introduction, to committee, through the final vote. But the action list are filled with formal legislative vocabulary, and in order for us to put in the business logic to place a bill in a certain stage of the law making process, we need to understand what ever step is.

However, the inner workings of the NYS Senate aren't particularly well documented online. But after speaking with Dean Hill, he directed me to a peer of his in the legislative staff, Mitzi Hart, to answer my questions.

Her clarifications have been immensely helpful, and although some further clarifications on the software end are still needed, we have a lot more of the information we needed to move on in our code.



Later today we will be posting a more formal plan and schedule for the project, as we feel that we haven't been publicly organized enough as of yet. We will also be posting up a layout of all of our planned software, some of which is in the works as of now, and some of which has yet to be started.

Monday, April 5, 2010

Django + WSGI, Ubuntu + OpenSSH

Tonight we finally fixed up a couple issues that had become major time sinks and obstacles to us getting anything done.

Problem: Django + WSGI
First, we discovered why we couldn't access our CSS files off the server. We had deployed Django on the apache server built into Ubuntu via mod_wsgi on a name based virtual host. The WSGIScriptAlias directive was taking all of our requests and running them through the urls file in our Django app. Upon finding no match, the server would return an empty and/or error page detailing the problem.

Solution
To solve this we created employed the following Alias and AliasMatch directives to catch the static files before they can be dispatch to WSGI for handling.
AliasMatch /media/(.*)
Alias /favicon.ico
Alias /robots.txt

Problem: Inability to Pull/Push
Cihan has spent the last couple weeks being unable to push/pull on the git repository. She set up nearly a dozen different public key identities on github but frustratingly none of them would validate.

Solution
Cihan runs Ubuntu and the Github connection seems to use OpenSSH. There is appearently an open issue here were identities aren't loaded into OpenSSH and recognized by a default scan of the ~/.ssh directory. Instead, a manual ssh-add is required to force the scan and load the identities. After this command everything turns out to work fine.

By adding the code provided by Github for automatic identity confirmation to her .bashrc file we were able to fix this issue (I did this first thing which explains why I never had this issue).

Additional Successes
Unfortunately because this was the last fix we worked out she was unable to take proper credit for setting up the Django Admin tool (I pushed the changes since she could not) so we'll give it to her now. Not that enabling the Admin tool would have been a big deal if not for the problems caused by our deployment solution (my fault there).

Such is life. Work continues in preparation for our presentation on Friday.