In [1]:
from __future__ import print_function

When analyzing data, I usually use the following three modules. I use pandas for data management, filtering, grouping, and processing. I use numpy for basic array math. I use toyplot for rendering the charts.


In [2]:
import pandas
import numpy
import toyplot
import toyplot.pdf
import toyplot.png
import toyplot.svg

print('Pandas version:  ', pandas.__version__)
print('Numpy version:   ', numpy.__version__)
print('Toyplot version: ', toyplot.__version__)


Pandas version:   0.19.2
Numpy version:    1.12.0
Toyplot version:  0.14.0-dev

Load in the "auto" dataset. This is a fun collection of data on cars manufactured between 1970 and 1982. The source for this data can be found at https://archive.ics.uci.edu/ml/datasets/Auto+MPG.

The data are stored in a text file containing columns of data. We use the pandas.read_table() method to parse the data and load it in a pandas DataFrame. The file does not contain a header row, so we need to specify the names of the columns manually.


In [3]:
column_names = ['MPG',
                'Cylinders',
                'Displacement',
                'Horsepower',
                'Weight',
                'Acceleration',
                'Model Year',
                'Origin',
                'Car Name']
data = pandas.read_table('auto-mpg.data',
                         delim_whitespace=True,
                         names=column_names,
                         index_col=False)

For this analysis I am going to group data by the car maker. The make is not directly stored in the data, but all the names start with the make, so extract the first word in that column.


In [4]:
data['Make'] = data['Car Name'].str.split().str.get(0)

The data has some inconsistencies with the make strings (misspellings or alternate spellings). Do some simple fixes.


In [5]:
data.ix[data['Make'] == 'chevroelt', 'Make'] = 'chevrolet'
data.ix[data['Make'] == 'chevy', 'Make'] = 'chevrolet'
data.ix[data['Make'] == 'maxda', 'Make'] = 'mazda'
data.ix[data['Make'] == 'mercedes-benz', 'Make'] = 'mercedes'
data.ix[data['Make'] == 'vokswagen', 'Make'] = 'volkswagen'
data.ix[data['Make'] == 'vw', 'Make'] = 'volkswagen'

In this plot we are going to show the average miles per gallon (MPG) rating for each car maker. We can use the pivot_table feature of pandas to get this information from the data. (Excel and other spreadsheets have similar functionality.)


In [6]:
average_mpg_per_make = data.pivot_table(columns='Make',
                                        values='MPG',
                                        aggfunc='mean')
len(average_mpg_per_make.index)


Out[6]:
31

There are many different makers represented in this data set, but several have only a few cars and perhaps are therefore not a signficant sample. Filter out the car makers that have fewer than 10 entries in the data. (Mostly I'm doing this to make these examples fit better even though it works OK with all the data, too.)


In [7]:
count_mpg_per_make = data.pivot_table(columns='Make',
                                      values='MPG',
                                      aggfunc='count')
filtered_mpg = \
    average_mpg_per_make[count_mpg_per_make >= 10]. \
        sort_values(ascending=False)
filtered_mpg


Out[7]:
Make
honda         33.761538
volkswagen    31.840909
datsun        31.113043
mazda         30.058333
toyota        28.372000
dodge         22.060714
plymouth      21.703226
oldsmobile    21.100000
chevrolet     20.219149
pontiac       20.012500
ford          19.694118
buick         19.182353
mercury       19.118182
amc           18.246429
Name: MPG, dtype: float64

Add a column with a car maker index so that we can plot by index. Note that we have filtered the make by those manufacturers that have at least 10 models, so any make with less than 10 models is filtered out.


In [8]:
make_to_index = pandas.Series(index=filtered_mpg.index,
                              data=xrange(0, len(filtered_mpg)))
data['Make Index'] = numpy.array(make_to_index[data['Make']])

Now use toyplot to plot the MPG of every car (that matches our criteria), organized by manufacturer.


In [9]:
canvas = toyplot.Canvas('4in', '2.6in')

axes = canvas.cartesian(bounds=(41,-9,6,-58),
                        ylabel = 'MPG')

axes.scatterplot(data.dropna()['Make Index'],
                 data.dropna()['MPG'],
                 marker='-',
                 size=15,
                 opacity=0.75)

# Label the x axis on the make. This is a bit harder than it should be.
axes.x.ticks.locator = \
    toyplot.locator.Explicit(labels=filtered_mpg.index)
axes.x.ticks.labels.angle = 45

# It's usually best to make the y-axis 0-based.
axes.y.domain.min = 0


hondavolkswagendatsunmazdatoyotadodgeplymoutholdsmobilechevroletpontiacfordbuickmercuryamc01020304050MPG

In [10]:
toyplot.pdf.render(canvas, 'Detail.pdf')
toyplot.svg.render(canvas, 'Detail.svg')
toyplot.png.render(canvas, 'Detail.png', scale=5)

In [ ]: