Example 02: SoSQL Queries

Constructing custom queries to conserve bandwith and computational resources

Setup



In [1]:

    
import os
# Note: we don't need Pandas
# Filters allow you to accomplish many basic operations automatically

from sodapy import Socrata

Find Some Data

As in the first example, I'm using the Santa Fe political contribution dataset.

https://opendata.socrata.com/dataset/Santa-Fe-Contributors/f92i-ik66.json



In [2]:

    
socrata_domain = 'opendata.socrata.com'
socrata_dataset_identifier = 'f92i-ik66'

# If you choose to use a token, run the following command on the terminal (or add it to your .bashrc)
# $ export SODAPY_APPTOKEN=<token>
socrata_token = os.environ.get("SODAPY_APPTOKEN")



In [3]:

    
client = Socrata(socrata_domain, socrata_token)

Use Metadata to Plan Your Query

You've probably looked through the column names and descriptions in the web UI, but it can be nice to have them right in your workspace as well.



In [4]:

    
metadata = client.get_metadata(socrata_dataset_identifier)
[x['name'] for x in metadata['columns']]









    Out[4]:





['NAME2',
 'AMOUNT-2',
 'PARTY',
 'NAME',
 'JOB',
 'COMPANY',
 'UPDATE',
 'DATE',
 'RECIPIENT',
 'AMOUNT',
 'STREET',
 'CITY']



In [5]:

    
meta_amount = [x for x in metadata['columns'] if x['name'] == 'AMOUNT'][0]
meta_amount









    Out[5]:





{'cachedContents': {'average': '2433.2075',
  'largest': '6100',
  'non_null': 800,
  'null': 0,
  'smallest': '1250',
  'sum': '1946566',
  'top': [{'count': 426, 'item': '2300'},
   {'count': 72, 'item': '4600'},
   {'count': 34, 'item': '1500'},
   {'count': 33, 'item': '2000'},
   {'count': 20, 'item': '1250'},
   {'count': 14, 'item': '2100'},
   {'count': 12, 'item': '1300'},
   {'count': 11, 'item': '3300'},
   {'count': 10, 'item': '1750'},
   {'count': 10, 'item': '2500'},
   {'count': 7, 'item': '1400'},
   {'count': 6, 'item': '1800'},
   {'count': 5, 'item': '1550'},
   {'count': 4, 'item': '2800'},
   {'count': 4, 'item': '2050'},
   {'count': 4, 'item': '1450'},
   {'count': 4, 'item': '1950'},
   {'count': 4, 'item': '2200'},
   {'count': 4, 'item': '1350'},
   {'count': 4, 'item': '1700'}]},
 'dataTypeName': 'number',
 'fieldName': 'amount',
 'format': {'aggregate': 'sum'},
 'id': 2303155,
 'name': 'AMOUNT',
 'position': 10,
 'renderTypeName': 'number',
 'tableColumnId': 1263296,
 'width': 142}

Efficiently Query for Data

Restrict rows to above-average donations



In [6]:

    
# Get the average from the metadata. Note that it's a string by default
meta_amount['cachedContents']['average']









    Out[6]:





'2433.2075'



In [7]:

    
# Use the 'where' argument to filter the data before downloading it
results = client.get(socrata_dataset_identifier, where="amount >= 2433")
print("Total number of non-null results: {}".format(meta_amount['cachedContents']['non_null']))
print("Number of results downloaded: {}".format(len(results)))
results[:3]









    



Total number of non-null results: 800
Number of results downloaded: 143






    Out[7]:





[{'amount': '6100',
  'amount_2': 'Donation of $6,100 to Presidential elections 2008',
  'city': 'Santa Fe NM',
  'company': 'Thornburg Companies',
  'date': 'Q3/2008',
  'job': 'Finance',
  'name': 'Garrett Thornburg',
  'name2': 'Garrett Thornburg',
  'party': 'Democrat',
  'recipient': 'Barack Obama',
  'street': '150 WASHINGTON AVE',
  'update': 'Updated'},
 {'amount': '5600',
  'amount_2': 'Donation of $5,600 to Presidential elections 2008',
  'city': 'Santa Fe NM',
  'company': 'FOREST REALTY',
  'date': 'Q3/2007',
  'job': 'REAL ESTATE',
  'name': 'Michael Daly',
  'name2': 'Michael Daly',
  'party': 'Democrat',
  'recipient': 'Bill Richardson',
  'street': '305 BROWNELL HOWLAND RD',
  'update': 'Updated'},
 {'amount': '5100',
  'amount_2': 'Donation of $5,100 to Presidential elections 2008',
  'city': 'Santa Fe NM',
  'company': 'James Currey Publishers',
  'date': 'Q3/2008',
  'job': 'Publisher',
  'name': 'Douglas Johnson',
  'name2': 'Douglas Johnson',
  'party': 'Democrat',
  'recipient': 'Barack Obama',
  'street': '48 WOODS LOOP',
  'update': 'Updated'}]

Restrict columns and order rows

Often, you know which columns you want, so you can further simplify the download.

It can also be valuable to have results in order, so that you can quickly grab the largest or smallest.



In [8]:

    
results = client.get(socrata_dataset_identifier,
                     where="amount < 2433",
                     select="amount, job",
                     order="amount ASC")
results[:3]









    Out[8]:





[{'amount': '1250', 'job': 'QA Architect'},
 {'amount': '1250', 'job': 'Artist'},
 {'amount': '1250', 'job': 'investor'}]

Perform basic operations

You can even accomplish some basic analytics operations like finding sums.

If you're planning on doing further processing, note that the numeric outputs are strings by default.



In [10]:

    
results = client.get(socrata_dataset_identifier,
                     group="recipient",
                     select="sum(amount), recipient",
                     order="sum(amount) DESC")
results









    Out[10]:





[{'recipient': 'Bill Richardson', 'sum_amount': '1020748'},
 {'recipient': 'Barack Obama', 'sum_amount': '602865'},
 {'recipient': 'Hillary Clinton', 'sum_amount': '185887'},
 {'recipient': 'John McCain', 'sum_amount': '60741'},
 {'recipient': 'John Edwards', 'sum_amount': '25550'},
 {'recipient': 'Christopher Dodd', 'sum_amount': '14200'},
 {'recipient': 'Rudy Giuliani', 'sum_amount': '13800'},
 {'recipient': 'Ron Paul', 'sum_amount': '11500'},
 {'recipient': 'Mitt Romney', 'sum_amount': '4600'},
 {'recipient': 'Joe Biden', 'sum_amount': '4600'},
 {'recipient': 'Fred Thompson', 'sum_amount': '2075'}]

Break download into managable chunks

Sometimes you do want all the data, but it would be too big for one download.

By default, all queries have a limit of 1000 rows, but you can manually set it higher or lower. If you want to loop through results, just use offset



In [11]:

    
results = client.get(socrata_dataset_identifier, limit=6, select="name, amount")
results









    Out[11]:





[{'amount': '6100', 'name': 'Garrett Thornburg'},
 {'amount': '5600', 'name': 'Michael Daly'},
 {'amount': '5100', 'name': 'Douglas Johnson'},
 {'amount': '5100', 'name': 'Lynn Mortensen'},
 {'amount': '5100', 'name': 'Ted Flicker'},
 {'amount': '4906', 'name': 'Jere Smith'}]



In [11]:

    
loop_size = 3
num_loops = 2

for i in range(num_loops):
    results = client.get(socrata_dataset_identifier,
                         select="name, amount",
                         limit=loop_size,
                         offset=loop_size * i)
    print("\n> Loop number: {}".format(i))
    
    # This simply formats the output nicely
    for result in results:
        print(result)









    



> Loop number: 0
{'amount': '6100', 'name': 'Garrett Thornburg'}
{'amount': '5600', 'name': 'Michael Daly'}
{'amount': '5100', 'name': 'Douglas Johnson'}

> Loop number: 1
{'amount': '5100', 'name': 'Lynn Mortensen'}
{'amount': '5100', 'name': 'Ted Flicker'}
{'amount': '4906', 'name': 'Jere Smith'}

Query strings

All of the queries above were made with method parameters, but you could also pass all the parameters at once in a SQL-like format



In [13]:

    
query = """
select 
    name, 
    amount
where
    amount > 1000
    and amount < 2000
limit
    5
"""

results = client.get(socrata_dataset_identifier, query=query)
results









    Out[13]:





[{'amount': '1995', 'name': 'Shelley Silverstein'},
 {'amount': '1974', 'name': 'Marvin Godner'},
 {'amount': '1954', 'name': 'Stuart Ashman'},
 {'amount': '1950', 'name': 'Hope Curtis'},
 {'amount': '1950', 'name': 'David Harwell'}]

Free text search

My brother just got a dog named Slider, so we were curious about how many other New York City dogs had that name.

Searches with q match anywhere in the row, which allows you to quickly search through data with several free text columns of interest.



In [20]:

    
nyc_dogs_domain = 'data.cityofnewyork.us'
nyc_dogs_dataset_identifier = 'nu7n-tubp'

nyc_dogs_client = Socrata(nyc_dogs_domain, socrata_token)
results = nyc_dogs_client.get(nyc_dogs_dataset_identifier,
                              q="Slider",
                              select="animalname, breedname")
results









    Out[20]:





[{'animalname': 'SLIDER', 'breedname': 'American Pit Bull Terrier/Pit Bull'},
 {'animalname': 'SLIDER', 'breedname': 'Cavalier King Charles Spaniel'},
 {'animalname': 'SLIDER ', 'breedname': 'Shih Tzu'},
 {'animalname': 'SLIDER', 'breedname': 'Wheaton Terrier'}]

Going Further

There's plenty more to do! Check out Queries using SODA for additional functionality