Data Manipulation Workshop

Ann Arbor Data Dive

Instructor: Jeff Lockhart

Date: 11/11/2017, 8:30 - 9:30 AM

Materials online at: github.com/jwlockhart/data_workshops

Import packages

  • Packages contain a whole bunch of useful tools and functions for doing things in python.
  • pandas is a package of tools for working with data.
  • Here I have told python to use the abbreviation pd to refer to pandas. Programmers often do this so that we can type less.
  • matplotlib is a package for making charts and graphs, and here we're going to use the pyplot part of it and abbreviate that as plt
  • %matplotlib inline is what Jupyter Notebooks call "magic." It tells the notebook to show us the graphs in the notebook rather than saving them as files or having them pop up.

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

Load data

  • This code reads in data so that we can work with it in python.
  • We'll use different code later to save what we have done into a file so that we can use it later.
  • pandas can read and write data saved in many formats with these other functions:
    • read_csv / to_csv
    • read_json / to_json
    • read_html / to_html
    • read_clipboard / to_clipboard
    • read_excel / to_excel
    • read_hdf / to_hdf
    • read_feather / to_feather
    • read_msgpack / to_msgpack
    • read_stata / to_stata
    • read_sas
    • read_pickle / to_pickle
    • read_sql / to_sql
    • read_gbq / to_gbq (Google Big Query)

In [ ]:
gss = pd.read_csv('gss.csv')

Learn a bit about our data


In [ ]:
print("The GSS data has", gss.shape[0], "rows and", gss.shape[1], "columns.")

In [ ]:
gss.shape

In [ ]:
gss.columns

Look at our data


In [ ]:
gss.head()

In [ ]:
gss.head(10)

In [ ]:
gss.tail(3)

In [ ]:
gss['age'].head()

In [ ]:
gss[['age', 'education.num', 'education']].head()

Summary statistics


In [ ]:
gss['age'].mean()

In [ ]:
gss['age'].describe()

In [ ]:
gss['age'].describe().round(2)

Histograms


In [ ]:
gss['age'].hist()

In [ ]:
gss['age'].hist(bins=20)

Categorical data


In [ ]:
gss['education'].value_counts()

Missing data


In [ ]:
gss.head()

In [ ]:
gss = gss.replace('?', np.nan)
gss.head()

In [ ]:
gss = pd.read_csv('gss.csv', na_values=['?'])
gss.head()

In [ ]:
tmp = gss.dropna(subset=['occupation'])
print(gss.shape)
print(tmp.shape)
tmp.head()

Sorting and selecting


In [ ]:
gss.sort_values(by='age').head()

In [ ]:
gss.head()

In [ ]:
gss = gss.sort_values(by=['age', 'education.num'])
gss.head()

In [ ]:
white_18 = gss[gss['race'] == 'White']
white_18 = white_18[white_18['age'] == 18]
white_18.head()

In [ ]:
white_18.shape

In [ ]:
white_18 = gss[(gss['race'] == 'White') & (gss['age'] == 18)]
white_18.head()

Simple data manipulation

  • How many years has a person been out of school? Easy!

In [ ]:
gss['years_out'] = gss['age'] - gss['education.num']
gss.head()

In [ ]:
gss[['age', 'education.num', 'years_out', 'education']].head(10)

But is that correct?

  • We can look at summary statistics to see if they make sense. Does anyone have negative years_out of school (i.e. they went to school more years than their age)?
  • There's also someting odd going on with education.num. 9 years of education is a high school graduate? 4 years is 7th-8th grade? There may be something strange in how the data is coded; we need to look it up elsewhere to know.

In [ ]:
gss.years_out.describe()

In [ ]:
gss.years_out.min()

In [ ]:
gss[gss.years_out == 4]

More advanced data manipulation

  • Looks like we need to add 4 to our years of ed to fix it. There are several ways to do this.

In [ ]:
gss['new_years'] = gss['education.num'] + 4

In [ ]:
gss['new_years2'] = gss['education.num'].apply(lambda x: x+4)

In [ ]:
def add4(x):
    tmp = x+4
    return tmp

gss['new_years3'] = gss['education.num'].apply(add4)

In [ ]:
def add4(row):
    tmp = row['education.num'] + 4
    return tmp

gss['new_years3'] = gss.apply(add4, axis=1)

In [ ]:
gss.head()

What's the use of the more complicated ways?


In [ ]:
gss.occupation.value_counts()

In [ ]:
def is_office(job):    
    if job == 'Prof-specialty': 
        office = True
    elif job == 'Exec-managerial':
        office = True
    elif job == 'Adm-clerical':
        office = True
        #more here...
    else:
        office = False        
    
    return office

gss['office_job'] = gss['occupation'].apply(is_office)

gss[['occupation', 'office_job']].head(10)

In [ ]:
def is_office(job):
    office = False
    
    office_jobs = ['Prof-specialty', 'Exec-managerial', 'Adm-clerical',
                   'Sales', 'Tech-support']
    
    for o in office_jobs:
        if job == o:
            office= True
    
    return office

gss['office_job'] = gss['occupation'].apply(is_office)

gss[['occupation', 'office_job']].head(10)

In [ ]:
def is_office(job):
    office = False
    
    office_jobs = ['Prof-specialty', 'Exec-managerial', 'Adm-clerical',
                   'Sales', 'Tech-support']
    
    if job in office_jobs:
        office = True
    
    return office

gss['office_job'] = gss['occupation'].apply(is_office)

gss[['occupation', 'office_job']].head(10)

Groupby


In [ ]:
by_age = gss.groupby('age').mean()
by_age.head()

In [ ]:
by_age['hours.per.week'].plot.line()

In [ ]:
by_age = gss.groupby('age')

hours = pd.DataFrame()
hours['mean'] = by_age['hours.per.week'].mean()
hours['error'] = by_age['hours.per.week'].sem()
hours.head()

In [ ]:
hours.plot.line(y='mean', yerr='error', 
                title='Average hours worked by age, with error.')

In [ ]:
by_age_sex = gss.groupby(['sex', 'age'])

hours = pd.DataFrame()
hours['mean'] = by_age_sex['hours.per.week'].mean()
hours['error'] = by_age_sex['hours.per.week'].sem()
hours.head()

In [ ]:
hours = hours.unstack(level=0)
hours.head()

In [ ]:
hours.plot.line(y='mean', #yerr='error', 
                title='Average hours worked by age and sex')

In [ ]:
gss.boxplot(column='hours.per.week', by=['sex', 'office_job'], figsize=(8,8))

In [ ]:
by_job = gss.groupby(['occupation'])

hours = pd.DataFrame()
hours['mean'] = by_job['hours.per.week'].mean()
hours['error'] = by_job['hours.per.week'].sem()
hours

In [ ]:
hours = hours.sort_values(by='mean')
hours.plot.barh(y='mean', xerr='error')

In [ ]: