Example 02: SoSQL Queries

Constructing custom queries to conserve bandwith and computational resources

Setup


In [1]:
import os
# Note: we don't need Pandas
# Filters allow you to accomplish many basic operations automatically

from sodapy import Socrata

Find Some Data

As in the first example, I'm using the Santa Fe political contribution dataset.

https://opendata.socrata.com/dataset/Santa-Fe-Contributors/f92i-ik66.json


In [2]:
socrata_domain = 'opendata.socrata.com'
socrata_dataset_identifier = 'f92i-ik66'

# If you choose to use a token, run the following command on the terminal (or add it to your .bashrc)
# $ export SODAPY_APPTOKEN=<token>
socrata_token = os.environ.get("SODAPY_APPTOKEN")

In [3]:
client = Socrata(socrata_domain, socrata_token)

Use Metadata to Plan Your Query

You've probably looked through the column names and descriptions in the web UI, but it can be nice to have them right in your workspace as well.


In [4]:
metadata = client.get_metadata(socrata_dataset_identifier)
[x['name'] for x in metadata['columns']]


Out[4]:
['NAME2',
 'AMOUNT-2',
 'PARTY',
 'NAME',
 'JOB',
 'COMPANY',
 'UPDATE',
 'DATE',
 'RECIPIENT',
 'AMOUNT',
 'STREET',
 'CITY']

In [5]:
meta_amount = [x for x in metadata['columns'] if x['name'] == 'AMOUNT'][0]
meta_amount


Out[5]:
{'cachedContents': {'average': '2433.2075',
  'largest': '6100',
  'non_null': 800,
  'null': 0,
  'smallest': '1250',
  'sum': '1946566',
  'top': [{'count': 426, 'item': '2300'},
   {'count': 72, 'item': '4600'},
   {'count': 34, 'item': '1500'},
   {'count': 33, 'item': '2000'},
   {'count': 20, 'item': '1250'},
   {'count': 14, 'item': '2100'},
   {'count': 12, 'item': '1300'},
   {'count': 11, 'item': '3300'},
   {'count': 10, 'item': '1750'},
   {'count': 10, 'item': '2500'},
   {'count': 7, 'item': '1400'},
   {'count': 6, 'item': '1800'},
   {'count': 5, 'item': '1550'},
   {'count': 4, 'item': '2800'},
   {'count': 4, 'item': '2050'},
   {'count': 4, 'item': '1450'},
   {'count': 4, 'item': '1950'},
   {'count': 4, 'item': '2200'},
   {'count': 4, 'item': '1350'},
   {'count': 4, 'item': '1700'}]},
 'dataTypeName': 'number',
 'fieldName': 'amount',
 'format': {'aggregate': 'sum'},
 'id': 2303155,
 'name': 'AMOUNT',
 'position': 10,
 'renderTypeName': 'number',
 'tableColumnId': 1263296,
 'width': 142}

Efficiently Query for Data

Restrict rows to above-average donations


In [6]:
# Get the average from the metadata. Note that it's a string by default
meta_amount['cachedContents']['average']


Out[6]:
'2433.2075'

In [7]:
# Use the 'where' argument to filter the data before downloading it
results = client.get(socrata_dataset_identifier, where="amount >= 2433")
print("Total number of non-null results: {}".format(meta_amount['cachedContents']['non_null']))
print("Number of results downloaded: {}".format(len(results)))
results[:3]


Total number of non-null results: 800
Number of results downloaded: 143
Out[7]:
[{'amount': '6100',
  'amount_2': 'Donation of $6,100 to Presidential elections 2008',
  'city': 'Santa Fe NM',
  'company': 'Thornburg Companies',
  'date': 'Q3/2008',
  'job': 'Finance',
  'name': 'Garrett Thornburg',
  'name2': 'Garrett Thornburg',
  'party': 'Democrat',
  'recipient': 'Barack Obama',
  'street': '150 WASHINGTON AVE',
  'update': 'Updated'},
 {'amount': '5600',
  'amount_2': 'Donation of $5,600 to Presidential elections 2008',
  'city': 'Santa Fe NM',
  'company': 'FOREST REALTY',
  'date': 'Q3/2007',
  'job': 'REAL ESTATE',
  'name': 'Michael Daly',
  'name2': 'Michael Daly',
  'party': 'Democrat',
  'recipient': 'Bill Richardson',
  'street': '305 BROWNELL HOWLAND RD',
  'update': 'Updated'},
 {'amount': '5100',
  'amount_2': 'Donation of $5,100 to Presidential elections 2008',
  'city': 'Santa Fe NM',
  'company': 'James Currey Publishers',
  'date': 'Q3/2008',
  'job': 'Publisher',
  'name': 'Douglas Johnson',
  'name2': 'Douglas Johnson',
  'party': 'Democrat',
  'recipient': 'Barack Obama',
  'street': '48 WOODS LOOP',
  'update': 'Updated'}]

Restrict columns and order rows

Often, you know which columns you want, so you can further simplify the download.

It can also be valuable to have results in order, so that you can quickly grab the largest or smallest.


In [8]:
results = client.get(socrata_dataset_identifier,
                     where="amount < 2433",
                     select="amount, job",
                     order="amount ASC")
results[:3]


Out[8]:
[{'amount': '1250', 'job': 'QA Architect'},
 {'amount': '1250', 'job': 'Artist'},
 {'amount': '1250', 'job': 'investor'}]

Perform basic operations

You can even accomplish some basic analytics operations like finding sums.

If you're planning on doing further processing, note that the numeric outputs are strings by default.


In [10]:
results = client.get(socrata_dataset_identifier,
                     group="recipient",
                     select="sum(amount), recipient",
                     order="sum(amount) DESC")
results


Out[10]:
[{'recipient': 'Bill Richardson', 'sum_amount': '1020748'},
 {'recipient': 'Barack Obama', 'sum_amount': '602865'},
 {'recipient': 'Hillary Clinton', 'sum_amount': '185887'},
 {'recipient': 'John McCain', 'sum_amount': '60741'},
 {'recipient': 'John Edwards', 'sum_amount': '25550'},
 {'recipient': 'Christopher Dodd', 'sum_amount': '14200'},
 {'recipient': 'Rudy Giuliani', 'sum_amount': '13800'},
 {'recipient': 'Ron Paul', 'sum_amount': '11500'},
 {'recipient': 'Mitt Romney', 'sum_amount': '4600'},
 {'recipient': 'Joe Biden', 'sum_amount': '4600'},
 {'recipient': 'Fred Thompson', 'sum_amount': '2075'}]

Break download into managable chunks

Sometimes you do want all the data, but it would be too big for one download.

By default, all queries have a limit of 1000 rows, but you can manually set it higher or lower. If you want to loop through results, just use offset


In [11]:
results = client.get(socrata_dataset_identifier, limit=6, select="name, amount")
results


Out[11]:
[{'amount': '6100', 'name': 'Garrett Thornburg'},
 {'amount': '5600', 'name': 'Michael Daly'},
 {'amount': '5100', 'name': 'Douglas Johnson'},
 {'amount': '5100', 'name': 'Lynn Mortensen'},
 {'amount': '5100', 'name': 'Ted Flicker'},
 {'amount': '4906', 'name': 'Jere Smith'}]

In [11]:
loop_size = 3
num_loops = 2

for i in range(num_loops):
    results = client.get(socrata_dataset_identifier,
                         select="name, amount",
                         limit=loop_size,
                         offset=loop_size * i)
    print("\n> Loop number: {}".format(i))
    
    # This simply formats the output nicely
    for result in results:
        print(result)


> Loop number: 0
{'amount': '6100', 'name': 'Garrett Thornburg'}
{'amount': '5600', 'name': 'Michael Daly'}
{'amount': '5100', 'name': 'Douglas Johnson'}

> Loop number: 1
{'amount': '5100', 'name': 'Lynn Mortensen'}
{'amount': '5100', 'name': 'Ted Flicker'}
{'amount': '4906', 'name': 'Jere Smith'}

Query strings

All of the queries above were made with method parameters, but you could also pass all the parameters at once in a SQL-like format


In [13]:
query = """
select 
    name, 
    amount
where
    amount > 1000
    and amount < 2000
limit
    5
"""

results = client.get(socrata_dataset_identifier, query=query)
results


Out[13]:
[{'amount': '1995', 'name': 'Shelley Silverstein'},
 {'amount': '1974', 'name': 'Marvin Godner'},
 {'amount': '1954', 'name': 'Stuart Ashman'},
 {'amount': '1950', 'name': 'Hope Curtis'},
 {'amount': '1950', 'name': 'David Harwell'}]

My brother just got a dog named Slider, so we were curious about how many other New York City dogs had that name.

Searches with q match anywhere in the row, which allows you to quickly search through data with several free text columns of interest.


In [20]:
nyc_dogs_domain = 'data.cityofnewyork.us'
nyc_dogs_dataset_identifier = 'nu7n-tubp'

nyc_dogs_client = Socrata(nyc_dogs_domain, socrata_token)
results = nyc_dogs_client.get(nyc_dogs_dataset_identifier,
                              q="Slider",
                              select="animalname, breedname")
results


Out[20]:
[{'animalname': 'SLIDER', 'breedname': 'American Pit Bull Terrier/Pit Bull'},
 {'animalname': 'SLIDER', 'breedname': 'Cavalier King Charles Spaniel'},
 {'animalname': 'SLIDER ', 'breedname': 'Shih Tzu'},
 {'animalname': 'SLIDER', 'breedname': 'Wheaton Terrier'}]

Going Further

There's plenty more to do! Check out Queries using SODA for additional functionality