In [1]:
import os
os.chdir('/home/potterzot/code/learning/exploratorydataanalysis')
Loading csv data and subsetting:
In [2]:
import pandas as ps
import numpy as np
data = ps.DataFrame(ps.read_csv('data/stateData.csv'))
In [3]:
data.columns
Out[3]:
In [4]:
data['highSchoolGrad'].values
Out[4]:
In [5]:
%matplotlib
%pylab inline
import matplotlib as plt
p = plt.pyplot.plot(data['population'], data['life.exp'])
show()
Alternatively, we can use ggplot:
In [125]:
from ggplot import *
ggplot(data, aes('population', 'life.exp')) + \
geom_line()
Out[125]:
We want to use the reddit data here, but for some reason it crashes at line 982 due to improper encoding, so we just do the first 981 lines.
To create an ordered variable,
In [130]:
import csv
r = csv.reader(open('data/reddit.csv', 'r'))
reddit = []
for i, l in enumerate(r):
reddit.append(l)
if i==981:
break
rdf = ps.DataFrame(reddit)
rdf.columns
Out[130]:
So we have some good data, but we notice the column names are not correctly recorded.
In [153]:
rdf = ps.DataFrame(reddit[1:],columns=reddit[0])
rdf.columns
Out[153]:
Okay, now our columns have header names and we can plot using header names, but we have another problem. The data include 'NA' instead of actual missing for some values, so let's fix that.
In [151]:
for col in rdf.columns:
rdf[col][rdf[col]=='NA'] = None
In [176]:
ggplot(rdf, aes('age.range')) + \
geom_bar(color='black', fill='steelblue') + \
xlab('Age') + \
ylab('Number of People')
Out[176]:
We don't like the order of the age bins, and unfortunately it doesn't seem that Pandas / matplotlib / ggplot allow ordering of the bins, which is a big drawback in python compared to R.
In [191]:
age_ordered = ps.Categorical.from_array(rdf['age.range'])
order = ['Under 18', '18-24', '25-34', '35-44', '45-54', '55-64', '65 or Above']
age_ordered.levels
rdf['age.ordered'] = age_ordered
And so far, when we graph, all we get is the same unordered method. There appear to be ways to make this work using matplotlib and specifying each bar, but they aren't great and are overly complicated.
In [192]:
ggplot(rdf, aes('age.ordered')) + \
geom_bar(color='black', fill='steelblue') + \
xlab('Age') + \
ylab('Number of People')
Out[192]: