Example 01: Basic Queries

Retrieving data from Socrata databases using sodapy

Setup


In [1]:
import os
import pandas as pd
import numpy as np

from sodapy import Socrata

Find some data

Though any organization can host their own data with Socrata's tools, Socrata also hosts several open datasets themselves:

https://opendata.socrata.com/browse

The following search options can help you find some great datasets for getting started:

  • Limit to data sets (pre-analyzed stuff is great, but if you're using sodapy you probably want the raw numbers!)
  • Sort by "Most Accessed"

Here's a link that applies those filters automatically.

Click on a few listings until you find one that looks interesting. Then click API and extract the following bits of data from the displayed url.

https://<opendata.socrata.com>/dataset/Santa-Fe-Contributors/<f92i-ik66>.json


In [4]:
# Enter the information from those sections here
socrata_domain = 'opendata.socrata.com'
socrata_dataset_identifier = 'f92i-ik66'

# App Tokens can be generated by creating an account at https://opendata.socrata.com/signup
# Tokens are optional (`None` can be used instead), though requests will be rate limited.
#
# If you choose to use a token, run the following command on the terminal (or add it to your .bashrc)
# $ export SODAPY_APPTOKEN=<token>
socrata_token = os.environ.get("SODAPY_APPTOKEN")

Get all the data


In [5]:
client = Socrata(socrata_domain, socrata_token)
print("Domain: {domain:}\nSession: {session:}\nURI Prefix: {uri_prefix:}".format(**client.__dict__))


Domain: opendata.socrata.com
Session: <requests.sessions.Session object at 0x7f8a0363d518>
URI Prefix: https://

In [6]:
results = client.get(socrata_dataset_identifier)
df = pd.DataFrame.from_dict(results)
df.head()


Out[6]:
amount amount_2 city company date job name name2 party recipient street update
0 6100 Donation of $6,100 to Presidential elections 2008 Santa Fe NM Thornburg Companies Q3/2008 Finance Garrett Thornburg Garrett Thornburg Democrat Barack Obama 150 WASHINGTON AVE Updated
1 5600 Donation of $5,600 to Presidential elections 2008 Santa Fe NM FOREST REALTY Q3/2007 REAL ESTATE Michael Daly Michael Daly Democrat Bill Richardson 305 BROWNELL HOWLAND RD Updated
2 5100 Donation of $5,100 to Presidential elections 2008 Santa Fe NM James Currey Publishers Q3/2008 Publisher Douglas Johnson Douglas Johnson Democrat Barack Obama 48 WOODS LOOP Updated
3 5100 Donation of $5,100 to Presidential elections 2008 Santa Fe NM Academy for Educational Developm Q3/2008 Chief of Party Lynn Mortensen Lynn Mortensen Democrat Barack Obama 4 AVILA RD Updated
4 5100 Donation of $5,100 to Presidential elections 2008 Santa Fe NM Self employed Q1/2008 Sculptor Ted Flicker Ted Flicker Democrat Barack Obama 164 TANO RD Updated

Success! Let's do some minimal cleaning and analysis just to justify the bandwidth used.


In [7]:
df['amount'] = df['amount'].astype(float)

In [8]:
by_candidate = df.groupby('recipient').amount.aggregate([np.sum, np.mean, np.size]).round(0)
by_candidate.sort_values('sum', ascending=False).head()


Out[8]:
sum mean size
recipient
Bill Richardson 1020748.0 2442.0 418.0
Barack Obama 602865.0 2491.0 242.0
Hillary Clinton 185887.0 2324.0 80.0
John McCain 60741.0 2025.0 30.0
John Edwards 25550.0 2555.0 10.0

Multiple Data Sources

That was much less annoying than downloading a CSV, though you can always save the dataframe to a CSV if you'd like. Where sodapy really shines though is in grabbing different data sources and mashing them together.

For example, let's compare 311 calls between New York City and Chattanooga, TN. Socrata makes it so easy, you'd be crazy not to do it!


In [12]:
nyc_domain = 'data.cityofnewyork.us'
nyc_dataset_identifier = 'fhrw-4uyv'
nyc_client = Socrata(nyc_domain, socrata_token)
nyc_results = nyc_client.get(nyc_dataset_identifier)
nyc_df = pd.DataFrame.from_dict(nyc_results)
print(nyc_df.shape)

chatt_domain = 'data.chattlibrary.org'
chatt_dataset_identifier = 'sf89-4qcw'
chatt_client = Socrata(chatt_domain, socrata_token)
chatt_results = chatt_client.get(chatt_dataset_identifier)
chatt_df = pd.DataFrame.from_dict(chatt_results)
print(chatt_df.shape)


(1000, 48)
(1000, 28)

In [11]:
# extract tree-related complaints
tree_related = pd.concat([
    nyc_df.complaint_type.str.contains(r'[T|t]ree').value_counts(),
    chatt_df.description.str.contains(r'[T|t]ree').value_counts()
], axis=1, keys=['nyc', 'chatt'])
tree_related.div(tree_related.sum()).round(2)


Out[11]:
nyc chatt
False 0.93 0.98
True 0.07 0.02

Looks like trees are a higher percentage of NYC complaints than Chattanooga's.

Note that we can only talk about percentages, since our query results got truncated to 1,000 rows.

What if we want to be smarter about what we ask for, so that we can get 100% of the subset of data we're most interested in? That's the subject of a future example, so stay tuned!

If you want to find more data sets, here's Socrata's data finder:

https://www.opendatanetwork.com/search