In [1]:
from __future__ import print_function

When analyzing data, I usually use the following three modules. I use pandas for data management, filtering, grouping, and processing. I use numpy for basic array math. I use toyplot for rendering the charts.


In [2]:
import pandas
import numpy
import toyplot
import toyplot.pdf
import toyplot.png
import toyplot.svg

print('Pandas version:  ', pandas.__version__)
print('Numpy version:   ', numpy.__version__)
print('Toyplot version: ', toyplot.__version__)


Pandas version:   0.19.2
Numpy version:    1.12.0
Toyplot version:  0.14.0-dev

Load in the "auto" dataset. This is a fun collection of data on cars manufactured between 1970 and 1982. The source for this data can be found at https://archive.ics.uci.edu/ml/datasets/Auto+MPG.

The data are stored in a text file containing columns of data. We use the pandas.read_table() method to parse the data and load it in a pandas DataFrame. The file does not contain a header row, so we need to specify the names of the columns manually.


In [3]:
column_names = ['MPG',
                'Cylinders',
                'Displacement',
                'Horsepower',
                'Weight',
                'Acceleration',
                'Model Year',
                'Origin',
                'Car Name']
data = pandas.read_table('auto-mpg.data',
                         delim_whitespace=True,
                         names=column_names,
                         index_col=False)

For this analysis I am going to group data by the car maker. The make is not directly stored in the data, but all the names start with the make, so extract the first word in that column.


In [4]:
data['Make'] = data['Car Name'].str.split().str.get(0)

The data has some inconsistencies with the make strings (misspellings or alternate spellings). Do some simple fixes.


In [5]:
data.ix[data['Make'] == 'chevroelt', 'Make'] = 'chevrolet'
data.ix[data['Make'] == 'chevy', 'Make'] = 'chevrolet'
data.ix[data['Make'] == 'maxda', 'Make'] = 'mazda'
data.ix[data['Make'] == 'mercedes-benz', 'Make'] = 'mercedes'
data.ix[data['Make'] == 'vokswagen', 'Make'] = 'volkswagen'
data.ix[data['Make'] == 'vw', 'Make'] = 'volkswagen'

Use toyplot to plot the measurements of horsepower vs weight. We should expect a general trend to higher horsepower to weight with some outliers (such as for sports cars).

We are using this to demonstrate coloring. First, do a simple coloring by country origin with the default color map. This should be reasonable colors.


In [6]:
canvas = toyplot.Canvas('4in', '2.6in')

axes = canvas.cartesian(bounds=(41,-3,3,-44),
                        xlabel = 'Weight',
                        ylabel = 'Horsepower')

colormap = toyplot.color.CategoricalMap()

# Note that this data has some invalid measurements for Horsepower. Thus, we need
# to filter those rows out. That is what the [data['Horsepower'] != '?'] is for

axes.scatterplot(data['Weight'][data['Horsepower'] != '?'],
                 data['Horsepower'][data['Horsepower'] != '?'],
                 color=(numpy.array(data['Origin'][data['Horsepower'] != '?'])-1,colormap))

# It's usually best to make the y-axis 0-based.
axes.y.domain.min = 0

# Add some labels
axes.text(4700, 125, 'USA')
axes.text(2100, 28, 'Europe')
axes.text(2820, 145, 'Japan')


Out[6]:
<toyplot.mark.Text at 0x110e41d10>
USAEuropeJapan2000300040005000Weight0100200Horsepower

In [7]:
toyplot.pdf.render(canvas, 'Colors.pdf')
toyplot.svg.render(canvas, 'Colors.svg')
toyplot.png.render(canvas, 'Colors.png', scale=5)

Repeate the plot colored by the make. This is a crazy amount of colors. Also choose a bad color palette.

To color by make, we actually need to convert the strings to numbers that toyplot can look up in a linear map. Create that map and make a column of make indices.


In [8]:
unique_makes = data['Make'].unique()
make_index_map = pandas.Series(index=unique_makes,
                               data=xrange(0, len(unique_makes)))
data['Make Index'] = numpy.array(make_index_map[data['Make']])

I am also going to demonstrate a bad set of colors. Toyplot actually cares about good colors, so I have to jump through a few hoops to load up a bad color map.


In [9]:
bad_color_palette = toyplot.color.Palette(
    ['#FF0000', '#FFFF00', '#00FF00',
     '#00FFFF', '#0000FF'])
bad_colormap = toyplot.color.LinearMap(bad_color_palette)

In [10]:
canvas = toyplot.Canvas('4in', '2.6in')

axes = canvas.cartesian(bounds=(41,-3,3,-44),
                        xlabel = 'Weight (lb)',
                        ylabel = 'Horsepower')

# Note that this data has some invalid measurements for Horsepower. Thus, we need
# to filter those rows out. That is what the [data['Horsepower'] != '?'] is for

axes.scatterplot(data['Weight'][data['Horsepower'] != '?'],
                 data['Horsepower'][data['Horsepower'] != '?'],
                 color=(numpy.array(data['Make Index'][data['Horsepower'] != '?'])-1,
                        bad_colormap))

# It's usually best to make the y-axis 0-based.
axes.y.domain.min = 0


2000300040005000Weight (lb)0100200Horsepower

In [11]:
toyplot.pdf.render(canvas, 'Colors_Bad.pdf')
toyplot.svg.render(canvas, 'Colors_Bad.svg')
toyplot.png.render(canvas, 'Colors_Bad.png', scale=5)

In [ ]: