In [1]:
from __future__ import print_function
When analyzing data, I usually use the following three modules. I use pandas for data management, filtering, grouping, and processing. I use numpy for basic array math. I use toyplot for rendering the charts.
In [2]:
import pandas
import numpy
import toyplot
import toyplot.pdf
import toyplot.png
import toyplot.svg
print('Pandas version: ', pandas.__version__)
print('Numpy version: ', numpy.__version__)
print('Toyplot version: ', toyplot.__version__)
Load in the "auto" dataset. This is a fun collection of data on cars manufactured between 1970 and 1982. The source for this data can be found at https://archive.ics.uci.edu/ml/datasets/Auto+MPG.
The data are stored in a text file containing columns of data. We use the pandas.read_table() method to parse the data and load it in a pandas DataFrame. The file does not contain a header row, so we need to specify the names of the columns manually.
In [3]:
column_names = ['MPG',
'Cylinders',
'Displacement',
'Horsepower',
'Weight',
'Acceleration',
'Model Year',
'Origin',
'Car Name']
data = pandas.read_table('auto-mpg.data',
delim_whitespace=True,
names=column_names,
index_col=False)
For this analysis I am going to group data by the car maker. The make is not directly stored in the data, but all the names start with the make, so extract the first word in that column.
In [4]:
data['Make'] = data['Car Name'].str.split().str.get(0)
The data has some inconsistencies with the make strings (misspellings or alternate spellings). Do some simple fixes.
In [5]:
data.ix[data['Make'] == 'chevroelt', 'Make'] = 'chevrolet'
data.ix[data['Make'] == 'chevy', 'Make'] = 'chevrolet'
data.ix[data['Make'] == 'maxda', 'Make'] = 'mazda'
data.ix[data['Make'] == 'mercedes-benz', 'Make'] = 'mercedes'
data.ix[data['Make'] == 'vokswagen', 'Make'] = 'volkswagen'
data.ix[data['Make'] == 'vw', 'Make'] = 'volkswagen'
In this plot we are going to show the average miles per gallon (MPG) rating for each car maker. We can use the pivot_table feature of pandas to get this information from the data. (Excel and other spreadsheets have similar functionality.)
In [6]:
average_mpg_per_make = data.pivot_table(columns='Make',
values='MPG',
aggfunc='mean')
len(average_mpg_per_make.index)
Out[6]:
There are many different makers represented in this data set, but several have only a few cars and perhaps are therefore not a signficant sample. Filter out the car makers that have fewer than 10 entries in the data. (Mostly I'm doing this to make these examples fit better even though it works OK with all the data, too.)
In [7]:
count_mpg_per_make = data.pivot_table(columns='Make',
values='MPG',
aggfunc='count')
filtered_mpg = \
average_mpg_per_make[count_mpg_per_make >= 10]. \
sort_values(ascending=False)
filtered_mpg
Out[7]:
Add a column with a car maker index so that we can plot by index. Note that we have filtered the make by those manufacturers that have at least 10 models, so any make with less than 10 models is filtered out.
In [8]:
make_to_index = pandas.Series(index=filtered_mpg.index,
data=xrange(0, len(filtered_mpg)))
data['Make Index'] = numpy.array(make_to_index[data['Make']])
Now use toyplot to plot the MPG of every car (that matches our criteria), organized by manufacturer.
In [9]:
canvas = toyplot.Canvas('4in', '2.6in')
axes = canvas.cartesian(bounds=(41,-9,6,-58),
ylabel = 'MPG')
axes.scatterplot(data.dropna()['Make Index'],
data.dropna()['MPG'],
marker='-',
size=15,
opacity=0.75)
# Label the x axis on the make. This is a bit harder than it should be.
axes.x.ticks.locator = \
toyplot.locator.Explicit(labels=filtered_mpg.index)
axes.x.ticks.labels.angle = 45
# It's usually best to make the y-axis 0-based.
axes.y.domain.min = 0
In [10]:
toyplot.pdf.render(canvas, 'Detail.pdf')
toyplot.svg.render(canvas, 'Detail.svg')
toyplot.png.render(canvas, 'Detail.png', scale=5)
In [ ]: