Example 01: Basic Queries

Retrieving data from Socrata databases using sodapy

Setup



In [1]:

    
import os
import pandas as pd
import numpy as np

from sodapy import Socrata

Find some data

Though any organization can host their own data with Socrata's tools, Socrata also hosts several open datasets themselves:

https://opendata.socrata.com/browse

The following search options can help you find some great datasets for getting started:

Limit to data sets (pre-analyzed stuff is great, but if you're using sodapy you probably want the raw numbers!)
Sort by "Most Accessed"

Here's a link that applies those filters automatically.

Click on a few listings until you find one that looks interesting. Then click API and extract the following bits of data from the displayed url.

https://<opendata.socrata.com>/dataset/Santa-Fe-Contributors/<f92i-ik66>.json



In [4]:

    
# Enter the information from those sections here
socrata_domain = 'opendata.socrata.com'
socrata_dataset_identifier = 'f92i-ik66'

# App Tokens can be generated by creating an account at https://opendata.socrata.com/signup
# Tokens are optional (`None` can be used instead), though requests will be rate limited.
#
# If you choose to use a token, run the following command on the terminal (or add it to your .bashrc)
# $ export SODAPY_APPTOKEN=<token>
socrata_token = os.environ.get("SODAPY_APPTOKEN")

Get all the data



In [5]:

    
client = Socrata(socrata_domain, socrata_token)
print("Domain: {domain:}\nSession: {session:}\nURI Prefix: {uri_prefix:}".format(**client.__dict__))









    



Domain: opendata.socrata.com
Session: <requests.sessions.Session object at 0x7f8a0363d518>
URI Prefix: https://



In [6]:

    
results = client.get(socrata_dataset_identifier)
df = pd.DataFrame.from_dict(results)
df.head()









    Out[6]:







  
    
      
      amount
      amount_2
      city
      company
      date
      job
      name
      name2
      party
      recipient
      street
      update
    
  
  
    
      0
      6100
      Donation of $6,100 to Presidential elections 2008
      Santa Fe NM
      Thornburg Companies
      Q3/2008
      Finance
      Garrett Thornburg
      Garrett Thornburg
      Democrat
      Barack Obama
      150 WASHINGTON AVE
      Updated
    
    
      1
      5600
      Donation of $5,600 to Presidential elections 2008
      Santa Fe NM
      FOREST REALTY
      Q3/2007
      REAL ESTATE
      Michael Daly
      Michael Daly
      Democrat
      Bill Richardson
      305 BROWNELL HOWLAND RD
      Updated
    
    
      2
      5100
      Donation of $5,100 to Presidential elections 2008
      Santa Fe NM
      James Currey Publishers
      Q3/2008
      Publisher
      Douglas Johnson
      Douglas Johnson
      Democrat
      Barack Obama
      48 WOODS LOOP
      Updated
    
    
      3
      5100
      Donation of $5,100 to Presidential elections 2008
      Santa Fe NM
      Academy for Educational Developm
      Q3/2008
      Chief of Party
      Lynn Mortensen
      Lynn Mortensen
      Democrat
      Barack Obama
      4 AVILA RD
      Updated
    
    
      4
      5100
      Donation of $5,100 to Presidential elections 2008
      Santa Fe NM
      Self employed
      Q1/2008
      Sculptor
      Ted Flicker
      Ted Flicker
      Democrat
      Barack Obama
      164 TANO RD
      Updated

Success! Let's do some minimal cleaning and analysis just to justify the bandwidth used.



In [7]:

    
df['amount'] = df['amount'].astype(float)



In [8]:

    
by_candidate = df.groupby('recipient').amount.aggregate([np.sum, np.mean, np.size]).round(0)
by_candidate.sort_values('sum', ascending=False).head()









    Out[8]:







  
    
      
      sum
      mean
      size
    
    
      recipient
      
      
      
    
  
  
    
      Bill Richardson
      1020748.0
      2442.0
      418.0
    
    
      Barack Obama
      602865.0
      2491.0
      242.0
    
    
      Hillary Clinton
      185887.0
      2324.0
      80.0
    
    
      John McCain
      60741.0
      2025.0
      30.0
    
    
      John Edwards
      25550.0
      2555.0
      10.0

Multiple Data Sources

That was much less annoying than downloading a CSV, though you can always save the dataframe to a CSV if you'd like. Where sodapy really shines though is in grabbing different data sources and mashing them together.

For example, let's compare 311 calls between New York City and Chattanooga, TN. Socrata makes it so easy, you'd be crazy not to do it!



In [12]:

    
nyc_domain = 'data.cityofnewyork.us'
nyc_dataset_identifier = 'fhrw-4uyv'
nyc_client = Socrata(nyc_domain, socrata_token)
nyc_results = nyc_client.get(nyc_dataset_identifier)
nyc_df = pd.DataFrame.from_dict(nyc_results)
print(nyc_df.shape)

chatt_domain = 'data.chattlibrary.org'
chatt_dataset_identifier = 'sf89-4qcw'
chatt_client = Socrata(chatt_domain, socrata_token)
chatt_results = chatt_client.get(chatt_dataset_identifier)
chatt_df = pd.DataFrame.from_dict(chatt_results)
print(chatt_df.shape)









    



(1000, 48)
(1000, 28)



In [11]:

    
# extract tree-related complaints
tree_related = pd.concat([
    nyc_df.complaint_type.str.contains(r'[T|t]ree').value_counts(),
    chatt_df.description.str.contains(r'[T|t]ree').value_counts()
], axis=1, keys=['nyc', 'chatt'])
tree_related.div(tree_related.sum()).round(2)

Looks like trees are a higher percentage of NYC complaints than Chattanooga's.

Note that we can only talk about percentages, since our query results got truncated to 1,000 rows.

What if we want to be smarter about what we ask for, so that we can get 100% of the subset of data we're most interested in? That's the subject of a future example, so stay tuned!

If you want to find more data sets, here's Socrata's data finder:

https://www.opendatanetwork.com/search

	nyc	chatt
False	0.93	0.98
True	0.07	0.02

	amount	amount_2	city	company	date	job	name	name2	party	recipient	street	update
0	6100	Donation of $6,100 to Presidential elections 2008	Santa Fe NM	Thornburg Companies	Q3/2008	Finance	Garrett Thornburg	Garrett Thornburg	Democrat	Barack Obama	150 WASHINGTON AVE	Updated
1	5600	Donation of $5,600 to Presidential elections 2008	Santa Fe NM	FOREST REALTY	Q3/2007	REAL ESTATE	Michael Daly	Michael Daly	Democrat	Bill Richardson	305 BROWNELL HOWLAND RD	Updated
2	5100	Donation of $5,100 to Presidential elections 2008	Santa Fe NM	James Currey Publishers	Q3/2008	Publisher	Douglas Johnson	Douglas Johnson	Democrat	Barack Obama	48 WOODS LOOP	Updated
3	5100	Donation of $5,100 to Presidential elections 2008	Santa Fe NM	Academy for Educational Developm	Q3/2008	Chief of Party	Lynn Mortensen	Lynn Mortensen	Democrat	Barack Obama	4 AVILA RD	Updated
4	5100	Donation of $5,100 to Presidential elections 2008	Santa Fe NM	Self employed	Q1/2008	Sculptor	Ted Flicker	Ted Flicker	Democrat	Barack Obama	164 TANO RD	Updated

	sum	mean	size
recipient
Bill Richardson	1020748.0	2442.0	418.0
Barack Obama	602865.0	2491.0	242.0
Hillary Clinton	185887.0	2324.0	80.0
John McCain	60741.0	2025.0	30.0
John Edwards	25550.0	2555.0	10.0