Lesson 2: Basic Data Work With Python

Python creations of the same efforts as used in R for Exploratory Data Analysis.

First, changing the directory in python:


In [1]:
import os
os.chdir('/home/potterzot/code/learning/exploratorydataanalysis')

Loading csv data and subsetting:


In [2]:
import pandas as ps
import numpy as np
data = ps.DataFrame(ps.read_csv('data/stateData.csv'))

In [3]:
data.columns


Out[3]:
Index(['Unnamed: 0', 'state.abb', 'state.area', 'state.region', 'population', 'income', 'illiteracy', 'life.exp', 'murder', 'highSchoolGrad', 'frost', 'area'], dtype='object')

In [4]:
data['highSchoolGrad'].values


Out[4]:
array([ 41.3,  66.7,  58.1,  39.9,  62.6,  63.9,  56. ,  54.6,  52.6,
        40.6,  61.9,  59.5,  52.6,  52.9,  59. ,  59.9,  38.5,  42.2,
        54.7,  52.3,  58.5,  52.8,  57.6,  41. ,  48.8,  59.2,  59.3,
        65.2,  57.6,  52.5,  55.2,  52.7,  38.5,  50.3,  53.2,  51.6,
        60. ,  50.2,  46.4,  37.8,  53.3,  41.8,  47.4,  67.3,  57.1,
        47.8,  63.5,  41.6,  54.5,  62.9])

Plotting

Plotting in python happens with matplotlib and/or ggplot libraries. %pylab inline makes the plots happen within the ipython notebook.


In [5]:
%matplotlib
%pylab inline
import matplotlib as plt
p = plt.pyplot.plot(data['population'], data['life.exp'])
show()


Using matplotlib backend: Qt4Agg
Populating the interactive namespace from numpy and matplotlib

Alternatively, we can use ggplot:


In [125]:
from ggplot import *
ggplot(data, aes('population', 'life.exp')) + \
    geom_line()


Out[125]:
<ggplot: (-9223363305208738602)>

We want to use the reddit data here, but for some reason it crashes at line 982 due to improper encoding, so we just do the first 981 lines.

To create an ordered variable,


In [130]:
import csv
r = csv.reader(open('data/reddit.csv', 'r'))
reddit = []
for i, l in enumerate(r):
    reddit.append(l)
    if i==981:
        break
rdf = ps.DataFrame(reddit)
rdf.columns


Out[130]:
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13], dtype='int64')

So we have some good data, but we notice the column names are not correctly recorded.


In [153]:
rdf = ps.DataFrame(reddit[1:],columns=reddit[0])
rdf.columns


Out[153]:
Index(['id', 'gender', 'age.range', 'marital.status', 'employment.status', 'military.service', 'children', 'education', 'country', 'state', 'income.range', 'fav.reddit', 'dog.cat', 'cheese'], dtype='object')

Okay, now our columns have header names and we can plot using header names, but we have another problem. The data include 'NA' instead of actual missing for some values, so let's fix that.


In [151]:
for col in rdf.columns:
    rdf[col][rdf[col]=='NA'] = None

In [176]:
ggplot(rdf, aes('age.range')) + \
  geom_bar(color='black', fill='steelblue') + \
  xlab('Age') + \
  ylab('Number of People')


Out[176]:
<ggplot: (8731645970096)>

We don't like the order of the age bins, and unfortunately it doesn't seem that Pandas / matplotlib / ggplot allow ordering of the bins, which is a big drawback in python compared to R.


In [191]:
age_ordered = ps.Categorical.from_array(rdf['age.range'])
order = ['Under 18', '18-24', '25-34', '35-44', '45-54', '55-64', '65 or Above']
age_ordered.levels
rdf['age.ordered'] = age_ordered

And so far, when we graph, all we get is the same unordered method. There appear to be ways to make this work using matplotlib and specifying each bar, but they aren't great and are overly complicated.


In [192]:
ggplot(rdf, aes('age.ordered')) + \
  geom_bar(color='black', fill='steelblue') + \
  xlab('Age') + \
  ylab('Number of People')


Out[192]:
<ggplot: (-9223363305210554357)>