American Time Use Survey

The American Time Use Survey (ATUS) collects data on how people spend their time: at work, doing household chores, watching TV, and so on. It's a fascinating set of information, and one that academics and journalists have put to good use.

The survey includes a number of related datasets. Here we read in the Activity Summary table from 2014. It's a zipped csv. The easiest way to access it is to download the zip file, unzip it, and read the csv inside. But why do it the easy way? We favor automation, so we read the url into Python and use zip tools to grab the data we want. It's a standard set of steps, worth getting used to.

This IPython notebook was created by Dave Backus and Arnav Sood in Python 3.5 for the NYU Stern course Data Bootcamp.

Import packages


In [2]:
import pandas as pd             # data package
import requests, io             # internet and input tools  
import zipfile as zf            # zip file tools 
import sys                      # system module, used to get Python version 
import datetime as dt           # date tools, used to note the current date 

print('\nPython version: ', sys.version) 
print('Pandas version: ', pd.__version__)
print('Requests version: ', requests.__version__)
print("Today's date:", dt.date.today())


Python version:  3.5.1 |Anaconda 2.5.0 (64-bit)| (default, Jan 29 2016, 15:01:46) [MSC v.1900 64 bit (AMD64)]
Pandas version:  0.17.1
Requests version:  2.9.1
Today's date: 2016-02-27

Data input

The data comes as a zip file that contains a csv (lcleverly labeled .dat) and a few others we can ignore. Automated data entry involves these steps:

  • Get the file. This uses the requests package, which handles internet files and comes pre-installed with Anaconda. This kind of thing was hidden behind the scenes in the Pandas read_csv() and read_excel() functions, but here we need to do it for ourselves.
  • Convert to zip. Requests simply loads whatever's at the given url. The io module's io.Bytes reconstructs it as a file, here a zip file.
  • Unzip the file. We use the zipfile module, which is part of core Python, to extract the files inside.
  • Read in the csv's. We use read_csv as usual.

We found this Stack Overflow exchange helpful.

Digression. This is probably more than you want to know, but it's a reminder of what goes on behind the scenes when we apply read_csv to a url. Here we grab whatever is at the url. Then we get its contents, convert it to bytes, identify it as a zip file, and read its components using read_csv. It's a lot easier when this happens automatically, but a reminder what's involved if we ever have to look into the details.


In [4]:
# get "response" from url 
url = 'http://www.bls.gov/tus/special.requests/atussum_2014.zip'
r = requests.get(url) 

print('Response type:', type(r))
print('Response content:', type(r.content)) 
print('Respnse headers:\n', r.headers, sep='')

#%%
# convert bytes to zip file  
atuz = zf.ZipFile(io.BytesIO(r.content))   
print('Type of zipfile object:', type(atuz))

# what's in the zip file?
atuz.namelist()

#%%
# read datafile (atussum_2014.dat is a csv) 
#df  = pd.read_csv(atuz.open(atuz.namelist()[1]))
df  = pd.read_csv(atuz.open('atussum_2014.dat'))
print('Dimensions:', df.shape)


Response type: <class 'requests.models.Response'>
Response content: <class 'bytes'>
Respnse headers:
{'Content-Length': '776422', 'Content-Type': 'application/x-zip-compressed', 'Last-Modified': 'Fri, 12 Jun 2015 14:56:49 GMT', 'Connection': 'keep-alive', 'P3P': 'CP="NOI DSP COR NID CURa ADMa OUR STP"', 'Date': 'Sat, 27 Feb 2016 19:09:08 GMT', 'Cache-Control': 'no-cache', 'ETag': '"5a318a220a5d01:0"', 'Server': 'Microsoft-IIS/7.5', 'Accept-Ranges': 'bytes', 'PMP': 'IIS-MSFT'}
Type of zipfile object: <class 'zipfile.ZipFile'>
Dimensions: (11592, 409)

In [6]:
# try properties of subsets 
print('Variables and their dtypes:\n', df[list(range(30))].dtypes, sep='')


Variables and their dtypes:
tucaseid        int64
TUFINLWGT     float64
TRYHHCHILD      int64
TEAGE           int64
TESEX           int64
PEEDUCA         int64
PTDTRACE        int64
PEHSPNON        int64
GTMETSTA        int64
TELFS           int64
TEMJOT          int64
TRDPFTPT        int64
TESCHENR        int64
TESCHLVL        int64
TRSPPRES        int64
TESPEMPNOT      int64
TRERNWA         int64
TRCHILDNUM      int64
TRSPFTPT        int64
TEHRUSLT        int64
TUDIARYDAY      int64
TRHOLIDAY       int64
TRTEC           int64
TRTHH           int64
t010101         int64
t010102         int64
t010201         int64
t010299         int64
t010301         int64
t010399         int64
dtype: object

In [8]:
print('Value counts of some variables')
for var in list(df[list(range(20))]):
    print('\n', df[var].value_counts().head(), sep='')


Value counts of some variables

20140101142230    1
20140101141515    1
20140201141303    1
20141008142036    1
20140504140850    1
Name: tucaseid, dtype: int64

7989599.172399    2
4718611.349299    2
2703057.840695    2
2698846.655130    2
2710184.083009    2
Name: TUFINLWGT, dtype: int64

-1    6703
 1     456
 0     373
 2     325
 4     320
Name: TRYHHCHILD, dtype: int64

80    336
85    261
37    239
34    237
44    232
Name: TEAGE, dtype: int64

2    6468
1    5124
Name: TESEX, dtype: int64

39    2893
43    2436
40    2018
44    1161
42     674
Name: PEEDUCA, dtype: int64

1    9176
2    1685
4     469
3      65
7      61
Name: PTDTRACE, dtype: int64

2    9879
1    1713
Name: PEHSPNON, dtype: int64

1    9628
2    1875
3      89
Name: GTMETSTA, dtype: int64

1    6688
5    4125
4     443
2     274
3      62
Name: TELFS, dtype: int64

 2    6346
-1    4630
 1     616
Name: TEMJOT, dtype: int64

 1    5551
-1    4630
 2    1411
Name: TRDPFTPT, dtype: int64

-1    5571
 2    5027
 1     994
Name: TESCHENR, dtype: int64

-1    10598
 2      603
 1      391
Name: TESCHLVL, dtype: int64

3    5571
1    5562
2     459
Name: TRSPPRES, dtype: int64

-1    5571
 1    3998
 2    2023
Name: TESPEMPNOT, dtype: int64

-1         5374
 288461     242
 60000      132
 100000     112
 115384     111
Name: TRERNWA, dtype: int64

0    6703
1    2051
2    1801
3     762
4     208
Name: TRCHILDNUM, dtype: int64

-1    7594
 1    3220
 2     587
 3     191
Name: TRSPFTPT, dtype: int64

-1     4630
 40    2697
 50     573
-4      399
 45     380
Name: TEHRUSLT, dtype: int64

In [ ]:


In [ ]:


In [ ]:


In [ ]: