In [1]:
from __future__ import print_function

When analyzing data, I usually use the following three modules. I use pandas for data management, filtering, grouping, and processing. I use numpy for basic array math. I use toyplot for rendering the charts.


In [2]:
import pandas
import numpy
import toyplot
import toyplot.pdf
import toyplot.png
import toyplot.svg

print('Pandas version:  ', pandas.__version__)
print('Numpy version:   ', numpy.__version__)
print('Toyplot version: ', toyplot.__version__)


Pandas version:   0.19.2
Numpy version:    1.12.0
Toyplot version:  0.14.0-dev

Load in the "auto" dataset. This is a fun collection of data on cars manufactured between 1970 and 1982. The source for this data can be found at https://archive.ics.uci.edu/ml/datasets/Auto+MPG.

The data are stored in a text file containing columns of data. We use the pandas.read_table() method to parse the data and load it in a pandas DataFrame. The file does not contain a header row, so we need to specify the names of the columns manually.


In [3]:
column_names = ['MPG',
                'Cylinders',
                'Displacement',
                'Horsepower',
                'Weight',
                'Acceleration',
                'Model Year',
                'Origin',
                'Car Name']
data = pandas.read_table('auto-mpg.data',
                         delim_whitespace=True,
                         names=column_names,
                         index_col=False)

For this analysis I am going to group data by the car maker. The make is not directly stored in the data, but all the names start with the make, so extract the first word in that column.


In [4]:
data['Make'] = data['Car Name'].str.split().str.get(0)

The data has some inconsistencies with the make strings (misspellings or alternate spellings). Do some simple fixes.


In [5]:
data.ix[data['Make'] == 'chevroelt', 'Make'] = 'chevrolet'
data.ix[data['Make'] == 'chevy', 'Make'] = 'chevrolet'
data.ix[data['Make'] == 'maxda', 'Make'] = 'mazda'
data.ix[data['Make'] == 'mercedes-benz', 'Make'] = 'mercedes'
data.ix[data['Make'] == 'vokswagen', 'Make'] = 'volkswagen'
data.ix[data['Make'] == 'vw', 'Make'] = 'volkswagen'

In this plot we are going to show the average miles per gallon (MPG) rating for each car maker, and to be super cool we are going to order by average MPG. We can use the pivot_table feature of pandas to get this information from the data. (Excel and other spreadsheets have similar functionality.)


In [6]:
average_mpg_per_make = data.pivot_table(columns='Make',
                                        values='MPG',
                                        aggfunc='mean')
len(average_mpg_per_make.index)


Out[6]:
31

There are many different makers represented in this data set, but several have only a few cars and perhaps are therefore not a signficant sample. Filter out the car makers that have fewer than 10 entries in the data. (Mostly I'm doing this to make these examples fit better even though it works OK with all the data, too.)


In [7]:
count_mpg_per_make = data.pivot_table(columns='Make',
                                      values='MPG',
                                      aggfunc='count')
filtered_mpg = average_mpg_per_make[count_mpg_per_make >= 10]
filtered_mpg


Out[7]:
Make
amc           18.246429
buick         19.182353
chevrolet     20.219149
datsun        31.113043
dodge         22.060714
ford          19.694118
honda         33.761538
mazda         30.058333
mercury       19.118182
oldsmobile    21.100000
plymouth      21.703226
pontiac       20.012500
toyota        28.372000
volkswagen    31.840909
Name: MPG, dtype: float64

Now use toyplot to plot 1D histograms for each of these makes.


In [8]:
canvas = toyplot.Canvas('4in', '2.6in')

axes = canvas.cartesian(bounds=(41,-2,2,-58),
                        ylabel = 'Average MPG')

axes.plot(filtered_mpg)

# Label the x axis on the make. This is a bit harder than it should be.
axes.x.ticks.locator = \
    toyplot.locator.Explicit(labels=filtered_mpg.index)
axes.x.ticks.labels.angle = 45

# It's usually best to make the y-axis 0-based.
axes.y.domain.min = 0


amcbuickchevroletdatsundodgefordhondamazdamercuryoldsmobileplymouthpontiactoyotavolkswagen0102030Average MPG

In [9]:
toyplot.pdf.render(canvas, 'XY_Trend_Bad.pdf')
toyplot.svg.render(canvas, 'XY_Trend_Bad.svg')
toyplot.png.render(canvas, 'XY_Trend_Bad.png', scale=5)

A better approach for this data is to use a bar chart.


In [10]:
canvas = toyplot.Canvas('4in', '2.6in')

axes = canvas.cartesian(bounds=(41,-2,2,-58),
                        ylabel = 'Average MPG')

axes.bars(filtered_mpg)

# Label the x axis on the make. This is a bit harder than it should be.
axes.x.ticks.locator = \
    toyplot.locator.Explicit(labels=filtered_mpg.index)
axes.x.ticks.labels.angle = 45

# It's usually best to make the y-axis 0-based.
axes.y.domain.min = 0


amcbuickchevroletdatsundodgefordhondamazdamercuryoldsmobileplymouthpontiactoyotavolkswagen0102030Average MPG

In [11]:
toyplot.pdf.render(canvas, 'Bar.pdf')
toyplot.svg.render(canvas, 'Bar.svg')
toyplot.png.render(canvas, 'Bar.png', scale=5)

That is good, but the ordering is arbitrary (alphebetical). It would be even better if the bars were sorted by size.


In [12]:
sorted_mpg = filtered_mpg.sort_values(ascending=False)

canvas = toyplot.Canvas('4in', '2.6in')

axes = canvas.cartesian(bounds=(41,-2,2,-58),
                        ylabel = 'Average MPG')

axes.bars(sorted_mpg)

# Label the x axis on the make. This is a bit harder than it should be.
axes.x.ticks.locator = \
    toyplot.locator.Explicit(labels=sorted_mpg.index)
axes.x.ticks.labels.angle = 45

# It's usually best to make the y-axis 0-based.
axes.y.domain.min = 0


hondavolkswagendatsunmazdatoyotadodgeplymoutholdsmobilechevroletpontiacfordbuickmercuryamc0102030Average MPG

In [13]:
toyplot.pdf.render(canvas, 'Bar_Sorted.pdf')
toyplot.svg.render(canvas, 'Bar_Sorted.svg')
toyplot.png.render(canvas, 'Bar_Sorted.png', scale=5)

Bar charts also afford the ability to change the orientation, which can help with layout, labels, and space utilization.


In [14]:
sorted_mpg = filtered_mpg.sort_values(ascending=True)

canvas = toyplot.Canvas('4in', '2.6in')

axes = canvas.cartesian(bounds=(70,-2,2,-44),
                        xlabel = 'Average MPG')

axes.bars(sorted_mpg,
          along='y')

# Label the y axis on the make. This is a bit harder than it should be.
axes.y.ticks.locator = \
    toyplot.locator.Explicit(labels=sorted_mpg.index)
axes.y.ticks.labels.angle = -90

# It's usually best to make the y-axis 0-based.
axes.x.domain.min = 0


0102030Average MPGamcmercurybuickfordpontiacchevroletoldsmobileplymouthdodgetoyotamazdadatsunvolkswagenhonda

In [15]:
toyplot.pdf.render(canvas, 'Bar_Rotated.pdf')
toyplot.svg.render(canvas, 'Bar_Rotated.svg')
toyplot.png.render(canvas, 'Bar_Rotated.png', scale=5)

In [ ]: