In [1]:
from __future__ import print_function
When analyzing data, I usually use the following three modules. I use pandas for data management, filtering, grouping, and processing. I use numpy for basic array math. I use toyplot for rendering the charts.
In [2]:
import pandas
import numpy
import toyplot
import toyplot.pdf
import toyplot.png
import toyplot.svg
print('Pandas version: ', pandas.__version__)
print('Numpy version: ', numpy.__version__)
print('Toyplot version: ', toyplot.__version__)
Load in the "auto" dataset. This is a fun collection of data on cars manufactured between 1970 and 1982. The source for this data can be found at https://archive.ics.uci.edu/ml/datasets/Auto+MPG.
The data are stored in a text file containing columns of data. We use the pandas.read_table() method to parse the data and load it in a pandas DataFrame. The file does not contain a header row, so we need to specify the names of the columns manually.
In [3]:
column_names = ['MPG',
'Cylinders',
'Displacement',
'Horsepower',
'Weight',
'Acceleration',
'Model Year',
'Origin',
'Car Name']
data = pandas.read_table('auto-mpg.data',
delim_whitespace=True,
names=column_names,
index_col=False)
For this analysis I am going to group data by the car maker. The make is not directly stored in the data, but all the names start with the make, so extract the first word in that column.
In [4]:
data['Make'] = data['Car Name'].str.split().str.get(0)
The data has some inconsistencies with the make strings (misspellings or alternate spellings). Do some simple fixes.
In [5]:
data.ix[data['Make'] == 'chevroelt', 'Make'] = 'chevrolet'
data.ix[data['Make'] == 'chevy', 'Make'] = 'chevrolet'
data.ix[data['Make'] == 'maxda', 'Make'] = 'mazda'
data.ix[data['Make'] == 'mercedes-benz', 'Make'] = 'mercedes'
data.ix[data['Make'] == 'vokswagen', 'Make'] = 'volkswagen'
data.ix[data['Make'] == 'vw', 'Make'] = 'volkswagen'
In this plot we are going to show the average miles per gallon (MPG) rating for each car maker, and to be super cool we are going to order by average MPG. We can use the pivot_table feature of pandas to get this information from the data. (Excel and other spreadsheets have similar functionality.)
In [6]:
average_mpg_per_make = data.pivot_table(columns='Make',
values='MPG',
aggfunc='mean')
len(average_mpg_per_make.index)
Out[6]:
There are many different makers represented in this data set, but several have only a few cars and perhaps are therefore not a signficant sample. Filter out the car makers that have fewer than 10 entries in the data. (Mostly I'm doing this to make these examples fit better even though it works OK with all the data, too.)
In [7]:
count_mpg_per_make = data.pivot_table(columns='Make',
values='MPG',
aggfunc='count')
filtered_mpg = average_mpg_per_make[count_mpg_per_make >= 10]
filtered_mpg
Out[7]:
Now use toyplot to plot 1D histograms for each of these makes.
In [8]:
canvas = toyplot.Canvas('4in', '2.6in')
axes = canvas.cartesian(bounds=(41,-2,2,-58),
ylabel = 'Average MPG')
axes.plot(filtered_mpg)
# Label the x axis on the make. This is a bit harder than it should be.
axes.x.ticks.locator = \
toyplot.locator.Explicit(labels=filtered_mpg.index)
axes.x.ticks.labels.angle = 45
# It's usually best to make the y-axis 0-based.
axes.y.domain.min = 0
In [9]:
toyplot.pdf.render(canvas, 'XY_Trend_Bad.pdf')
toyplot.svg.render(canvas, 'XY_Trend_Bad.svg')
toyplot.png.render(canvas, 'XY_Trend_Bad.png', scale=5)
A better approach for this data is to use a bar chart.
In [10]:
canvas = toyplot.Canvas('4in', '2.6in')
axes = canvas.cartesian(bounds=(41,-2,2,-58),
ylabel = 'Average MPG')
axes.bars(filtered_mpg)
# Label the x axis on the make. This is a bit harder than it should be.
axes.x.ticks.locator = \
toyplot.locator.Explicit(labels=filtered_mpg.index)
axes.x.ticks.labels.angle = 45
# It's usually best to make the y-axis 0-based.
axes.y.domain.min = 0
In [11]:
toyplot.pdf.render(canvas, 'Bar.pdf')
toyplot.svg.render(canvas, 'Bar.svg')
toyplot.png.render(canvas, 'Bar.png', scale=5)
That is good, but the ordering is arbitrary (alphebetical). It would be even better if the bars were sorted by size.
In [12]:
sorted_mpg = filtered_mpg.sort_values(ascending=False)
canvas = toyplot.Canvas('4in', '2.6in')
axes = canvas.cartesian(bounds=(41,-2,2,-58),
ylabel = 'Average MPG')
axes.bars(sorted_mpg)
# Label the x axis on the make. This is a bit harder than it should be.
axes.x.ticks.locator = \
toyplot.locator.Explicit(labels=sorted_mpg.index)
axes.x.ticks.labels.angle = 45
# It's usually best to make the y-axis 0-based.
axes.y.domain.min = 0
In [13]:
toyplot.pdf.render(canvas, 'Bar_Sorted.pdf')
toyplot.svg.render(canvas, 'Bar_Sorted.svg')
toyplot.png.render(canvas, 'Bar_Sorted.png', scale=5)
Bar charts also afford the ability to change the orientation, which can help with layout, labels, and space utilization.
In [14]:
sorted_mpg = filtered_mpg.sort_values(ascending=True)
canvas = toyplot.Canvas('4in', '2.6in')
axes = canvas.cartesian(bounds=(70,-2,2,-44),
xlabel = 'Average MPG')
axes.bars(sorted_mpg,
along='y')
# Label the y axis on the make. This is a bit harder than it should be.
axes.y.ticks.locator = \
toyplot.locator.Explicit(labels=sorted_mpg.index)
axes.y.ticks.labels.angle = -90
# It's usually best to make the y-axis 0-based.
axes.x.domain.min = 0
In [15]:
toyplot.pdf.render(canvas, 'Bar_Rotated.pdf')
toyplot.svg.render(canvas, 'Bar_Rotated.svg')
toyplot.png.render(canvas, 'Bar_Rotated.png', scale=5)
In [ ]: