The American Time Use Survey (ATUS) collects data on how people spend their time: at work, doing household chores, watching TV, and so on. It's a fascinating set of information, and one that academics and journalists have put to good use.
The survey includes a number of related datasets. Here we read in the Activity Summary table from 2014. It's a zipped csv. The easiest way to access it is to download the zip file, unzip it, and read the csv inside. But why do it the easy way? We favor automation, so we read the url into Python and use zip tools to grab the data we want. It's a standard set of steps, worth getting used to.
This IPython notebook was created by Dave Backus and Arnav Sood in Python 3.5 for the NYU Stern course Data Bootcamp.
In [2]:
import pandas as pd # data package
import requests, io # internet and input tools
import zipfile as zf # zip file tools
import sys # system module, used to get Python version
import datetime as dt # date tools, used to note the current date
print('\nPython version: ', sys.version)
print('Pandas version: ', pd.__version__)
print('Requests version: ', requests.__version__)
print("Today's date:", dt.date.today())
The data comes as a zip file that contains a csv (lcleverly labeled .dat) and a few others we can ignore. Automated data entry involves these steps:
read_csv()
and read_excel()
functions, but here we need to do it for ourselves. io.Bytes
reconstructs it as a file, here a zip file. read_csv
as usual. We found this Stack Overflow exchange helpful.
Digression. This is probably more than you want to know, but it's a reminder of what goes on behind the scenes when we apply read_csv
to a url. Here we grab whatever is at the url. Then we get its contents, convert it to bytes, identify it as a zip file, and read its components using read_csv
. It's a lot easier when this happens automatically, but a reminder what's involved if we ever have to look into the details.
In [4]:
# get "response" from url
url = 'http://www.bls.gov/tus/special.requests/atussum_2014.zip'
r = requests.get(url)
print('Response type:', type(r))
print('Response content:', type(r.content))
print('Respnse headers:\n', r.headers, sep='')
#%%
# convert bytes to zip file
atuz = zf.ZipFile(io.BytesIO(r.content))
print('Type of zipfile object:', type(atuz))
# what's in the zip file?
atuz.namelist()
#%%
# read datafile (atussum_2014.dat is a csv)
#df = pd.read_csv(atuz.open(atuz.namelist()[1]))
df = pd.read_csv(atuz.open('atussum_2014.dat'))
print('Dimensions:', df.shape)
In [6]:
# try properties of subsets
print('Variables and their dtypes:\n', df[list(range(30))].dtypes, sep='')
In [8]:
print('Value counts of some variables')
for var in list(df[list(range(20))]):
print('\n', df[var].value_counts().head(), sep='')
In [ ]:
In [ ]:
In [ ]:
In [ ]: