In [1]:
from __future__ import print_function

When analyzing data, I usually use the following three modules. I use pandas for data management, filtering, grouping, and processing. I use numpy for basic array math. I use toyplot for rendering the charts.


In [2]:
import pandas
import numpy
import toyplot
import toyplot.pdf
import toyplot.png
import toyplot.svg

print('Pandas version:  ', pandas.__version__)
print('Numpy version:   ', numpy.__version__)
print('Toyplot version: ', toyplot.__version__)


Pandas version:   0.19.2
Numpy version:    1.12.0
Toyplot version:  0.14.0-dev

Load in the "auto" dataset. This is a fun collection of data on cars manufactured between 1970 and 1982. The source for this data can be found at https://archive.ics.uci.edu/ml/datasets/Auto+MPG.

The data are stored in a text file containing columns of data. We use the pandas.read_table() method to parse the data and load it in a pandas DataFrame. The file does not contain a header row, so we need to specify the names of the columns manually.


In [3]:
column_names = ['MPG',
                'Cylinders',
                'Displacement',
                'Horsepower',
                'Weight',
                'Acceleration',
                'Model Year',
                'Origin',
                'Car Name']
data = pandas.read_table('auto-mpg.data',
                         delim_whitespace=True,
                         names=column_names,
                         index_col=False)

The origin column indicates the country of origin for the car manufacture. It has three numeric values, 1, 2, or 3. These indicate USA, Europe, or Japan, respectively. Replace the origin column with a string representing the country name.


In [4]:
country_map = pandas.Series(index=[1,2,3],
                            data=['USA', 'Europe', 'Japan'])
data['Origin'] = numpy.array(country_map[data['Origin']])

In this plot we are going to show the trend of the average miles per gallon (MPG) rating for subsequent model years separated by country of origin. This time period saw a significant increase in MPG driven by the U.S. fuel crisis. We can use the pivot_table feature of pandas to get this information from the data. (Excel and other spreadsheets have similar functionality.)


In [5]:
average_mpg_per_year = data.pivot_table(index='Model Year',
                                        columns='Origin',
                                        values='MPG',
                                        aggfunc='mean')
average_mpg_per_year


Out[5]:
Origin Europe Japan USA
Model Year
70 25.200000 25.500000 15.272727
71 28.750000 29.500000 18.100000
72 22.000000 24.200000 16.277778
73 24.000000 20.000000 15.034483
74 27.000000 29.333333 18.333333
75 24.500000 27.500000 17.550000
76 24.250000 28.000000 19.431818
77 29.250000 27.416667 20.722222
78 24.950000 29.687500 21.772727
79 30.450000 32.950000 23.478261
80 37.288889 35.400000 25.914286
81 31.575000 32.958333 27.530769
82 40.000000 34.888889 29.450000

In [6]:
average_mpg_per_year.columns


Out[6]:
Index([u'Europe', u'Japan', u'USA'], dtype='object', name=u'Origin')

Now use toyplot to plot this trend on a standard x-y chart.


In [7]:
canvas = toyplot.Canvas('4in', '2.6in')

axes = canvas.cartesian(bounds=(41,-1,6,-43),
                        xlabel = 'Model Year',
                        ylabel = 'Average MPG')

for column in country_map:
    series = average_mpg_per_year[column]
    x = series.index + 1900
    y = numpy.array(series)
    axes.plot(x, y)
    axes.text(x[-1], y[-1], column,
              style={"text-anchor":"start",
                     "-toyplot-anchor-shift":"2px"})

# It's usually best to make the y-axis 0-based.
axes.y.domain.min = 0

# Toyplot is sometimes inaccurate in judging the width of labels.
axes.x.domain.max = 1984

# The labels can make for odd tick placement.
# Place them manually
axes.x.ticks.locator = \
    toyplot.locator.Explicit([1970,1974,1978,1982])


USAEuropeJapan1970197419781982Model Year010203040Average MPG

In [8]:
toyplot.pdf.render(canvas, 'MultiSeries.pdf')
toyplot.svg.render(canvas, 'MultiSeries.svg')
toyplot.png.render(canvas, 'MultiSeries.png', scale=5)

For the talk, I want to compare this to using a 3D plot. Toyplot does not yet have such silly plot capabilities, so write out the results of this pivot table to csv so we can easily load it into Excel.


In [9]:
average_mpg_per_year.to_csv('auto-mpg-origin-year.csv')

In one of my counterexamples, I remind the audiance to make colors consistent. Make a plot with inconsistent colors.


In [10]:
canvas = toyplot.Canvas('4in', '2.6in')

axes = canvas.cartesian(bounds=(41,-1,6,-43),
                        xlabel = 'Model Year',
                        ylabel = 'Average MPG')

for column in ['Europe', 'Japan', 'USA']:
    series = average_mpg_per_year[column]
    x = series.index + 1900
    y = numpy.array(series)
    axes.plot(x, y)
    axes.text(x[-1], y[-1], column,
              style={"text-anchor":"start",
                     "-toyplot-anchor-shift":"2px"})

# It's usually best to make the y-axis 0-based.
axes.y.domain.min = 0

# Toyplot is sometimes inaccurate in judging the width of labels.
axes.x.domain.max = 1984

# The labels can make for odd tick placement.
# Place them manually
axes.x.ticks.locator = \
    toyplot.locator.Explicit([1970,1974,1978,1982])


EuropeJapanUSA1970197419781982Model Year010203040Average MPG

In [11]:
toyplot.pdf.render(canvas, 'MultiSeries_Inconsistent_Colors.pdf')
toyplot.svg.render(canvas, 'MultiSeries_Inconsistent_colors.svg')
toyplot.png.render(canvas, 'MultiSeries_Inconsistent_colors.png', scale=5)

I make a point that it is a bad idea to clutter up the canvas with non-data items like grid lines. Create a counter example that has lots of distracting lines.


In [12]:
canvas = toyplot.Canvas('4in', '2.6in')

axes = canvas.cartesian(bounds=(41,-1,6,-43),
                        xlabel = 'Model Year',
                        ylabel = 'Average MPG')
    
# Create some grid lines. (Not a great idea.)
axes.hlines(xrange(0,41,5), color='black')
axes.vlines(xrange(1970,1983), color='black')

for column in country_map:
    series = average_mpg_per_year[column]
    x = series.index + 1900
    y = numpy.array(series)
    axes.plot(x, y)
    axes.text(x[-1], y[-1], column,
              style={"text-anchor":"start",
                     "-toyplot-anchor-shift":"2px"})

# It's usually best to make the y-axis 0-based.
axes.y.domain.min = 0

# Toyplot is sometimes inaccurate in judging the width of labels.
axes.x.domain.max = 1984

# The labels can make for odd tick placement.
# Place them manually
axes.x.ticks.locator = \
    toyplot.locator.Explicit([1970,1974,1978,1982])


USAEuropeJapan1970197419781982Model Year010203040Average MPG

In [13]:
toyplot.pdf.render(canvas, 'MultiSeries_Grid_Dark.pdf')
toyplot.svg.render(canvas, 'MultiSeries_Grid_Dark.svg')
toyplot.png.render(canvas, 'MultiSeries_Grid_Dark.png', scale=5)

If you really want gridlines, you should make them very subtle so they don't interfere with the actual data.


In [14]:
canvas = toyplot.Canvas('4in', '2.6in')

axes = canvas.cartesian(bounds=(41,-1,6,-43),
                        xlabel = 'Model Year',
                        ylabel = 'Average MPG')
    
# Create some grid lines. (Not a great idea.)
axes.hlines(xrange(0,41,5), color='lightgray')
axes.vlines(xrange(1970,1983), color='lightgray')

for column in country_map:
    series = average_mpg_per_year[column]
    x = series.index + 1900
    y = numpy.array(series)
    axes.plot(x, y)
    axes.text(x[-1], y[-1], column,
              style={"text-anchor":"start",
                     "-toyplot-anchor-shift":"2px"})

# It's usually best to make the y-axis 0-based.
axes.y.domain.min = 0

axes.x.domain.max = 1984

# The labels can make for odd tick placement.
# Place them manually
axes.x.ticks.locator = \
    toyplot.locator.Explicit([1970,1974,1978,1982])


USAEuropeJapan1970197419781982Model Year010203040Average MPG

In [15]:
toyplot.pdf.render(canvas, 'MultiSeries_Grid_Light.pdf')
toyplot.svg.render(canvas, 'MultiSeries_Grid_Light.svg')
toyplot.png.render(canvas, 'MultiSeries_Grid_Light.png', scale=5)

Frankly, vertical gridlines are usually not all that necessary. If you remove them, less clutter. Not going overboard on horizontal lines is also good.


In [16]:
canvas = toyplot.Canvas('4in', '2.6in')

axes = canvas.cartesian(bounds=(41,-1,6,-43),
                        xlabel = 'Model Year',
                        ylabel = 'Average MPG')
    
# Create some grid lines. (Not a great idea.)
axes.hlines(xrange(0,41,10), color='lightgray')

for column in country_map:
    series = average_mpg_per_year[column]
    x = series.index + 1900
    y = numpy.array(series)
    axes.plot(x, y)
    axes.text(x[-1], y[-1], column,
              style={"text-anchor":"start",
                     "-toyplot-anchor-shift":"2px"})

# It's usually best to make the y-axis 0-based.
axes.y.domain.min = 0

axes.x.domain.max = 1984

# The labels can make for odd tick placement.
# Place them manually
axes.x.ticks.locator = \
    toyplot.locator.Explicit([1970,1974,1978,1982])


USAEuropeJapan1970197419781982Model Year010203040Average MPG

In [17]:
toyplot.pdf.render(canvas, 'MultiSeries_Grid_Light_Fewer.pdf')
toyplot.svg.render(canvas, 'MultiSeries_Grid_Light_Fewer.svg')
toyplot.png.render(canvas, 'MultiSeries_Grid_Light_Fewer.png', scale=5)

I personally find grid lines a bit overrated. Don't fear not having grid lines at all, as in the first example.

Another pet peeve of my is legends. I hate them. They are stupid and only exist because those who make plots are too lazy to place labels well (and because that is hard). But if you use a legend, at least make sure the order in the legend is not inconsistent with the order of the data.


In [18]:
canvas = toyplot.Canvas('4in', '2.6in')

axes = canvas.cartesian(bounds=(41,-11,6,-43),
                        xlabel = 'Model Year',
                        ylabel = 'Average MPG')

marks = {}
for column in country_map:
    series = average_mpg_per_year[column]
    x = series.index + 1900
    y = numpy.array(series)
    marks[column] = axes.plot(x, y)

# It's usually best to make the y-axis 0-based.
axes.y.domain.min = 0

# The labels can make for odd tick placement.
# Place them manually
axes.x.ticks.locator = \
    toyplot.locator.Explicit([1970,1974,1978,1982])
    
canvas.legend([('USA', marks['USA']),
               ('Europe', marks['Europe']),
               ('Japan', marks['Japan'])],
              rect=('-1in', '-1.25in', '1in', '0.75in'))


Out[18]:
<toyplot.mark.Legend at 0x116941c50>
1970197419781982Model Year010203040Average MPGUSAEuropeJapan

In [19]:
toyplot.pdf.render(canvas, 'Legend_Backward.pdf')
toyplot.svg.render(canvas, 'Legend_Backward.svg')
toyplot.png.render(canvas, 'Legend_Backward.png', scale=5)

Do it again, but at least order the legend correctly.


In [20]:
canvas = toyplot.Canvas('4in', '2.6in')

axes = canvas.cartesian(bounds=(41,-11,6,-43),
                        xlabel = 'Model Year',
                        ylabel = 'Average MPG')

marks = {}
for column in country_map:
    series = average_mpg_per_year[column]
    x = series.index + 1900
    y = numpy.array(series)
    marks[column] = axes.plot(x, y)

# It's usually best to make the y-axis 0-based.
axes.y.domain.min = 0

# The labels can make for odd tick placement.
# Place them manually
axes.x.ticks.locator = \
    toyplot.locator.Explicit([1970,1974,1978,1982])
    
canvas.legend([('Europe', marks['Europe']),
               ('Japan', marks['Japan']),
               ('USA', marks['USA'])],
              rect=('-1in', '-1.25in', '1in', '0.75in'))


Out[20]:
<toyplot.mark.Legend at 0x116830b90>
1970197419781982Model Year010203040Average MPGEuropeJapanUSA

In [21]:
toyplot.pdf.render(canvas, 'Legend_OK.pdf')
toyplot.svg.render(canvas, 'Legend_OK.svg')
toyplot.png.render(canvas, 'Legend_OK.png', scale=5)

In [ ]: