In [1]:
from __future__ import print_function

When analyzing data, I usually use the following three modules. I use pandas for data management, filtering, grouping, and processing. I use numpy for basic array math. I use toyplot for rendering the charts.


In [2]:
import pandas
import numpy
import toyplot
import toyplot.pdf
import toyplot.png
import toyplot.svg

print('Pandas version:  ', pandas.__version__)
print('Numpy version:   ', numpy.__version__)
print('Toyplot version: ', toyplot.__version__)


Pandas version:   0.19.2
Numpy version:    1.12.0
Toyplot version:  0.14.0-dev

Load in the "auto" dataset. This is a fun collection of data on cars manufactured between 1970 and 1982. The source for this data can be found at https://archive.ics.uci.edu/ml/datasets/Auto+MPG.

The data are stored in a text file containing columns of data. We use the pandas.read_table() method to parse the data and load it in a pandas DataFrame. The file does not contain a header row, so we need to specify the names of the columns manually.


In [3]:
column_names = ['MPG',
                'Cylinders',
                'Displacement',
                'Horsepower',
                'Weight',
                'Acceleration',
                'Model Year',
                'Origin',
                'Car Name']
data = pandas.read_table('auto-mpg.data',
                         delim_whitespace=True,
                         names=column_names,
                         index_col=False)

Use toyplot to plot the measurements of horsepower vs weight. We should expect a general trend to higher horsepower to weight with some outliers (such as for sports cars).


In [4]:
canvas = toyplot.Canvas('4in', '2.6in')

axes = canvas.cartesian(bounds=(41,-3,3,-44),
                        xlabel = 'Weight (lb)',
                        ylabel = 'Horsepower')

# Note that this data has some invalid measurements for Horsepower. Thus, we need
# to filter those rows out. That is what the [data['Horsepower'] != '?'] is for

axes.scatterplot(data['Weight'][data['Horsepower'] != '?'],
                 data['Horsepower'][data['Horsepower'] != '?'])

# It's usually best to make the y-axis 0-based.
axes.y.domain.min = 0


2000300040005000Weight (lb)0100200Horsepower

In [5]:
toyplot.pdf.render(canvas, 'XY_Scatterplot.pdf')
toyplot.svg.render(canvas, 'XY_Scatterplot.svg')
toyplot.png.render(canvas, 'XY_Scatterplot.png', scale=5)

Repeat the plot, but with a trend line. This is a bad example as the order of the size of the cars is ore arbitrary than, for example, the year they were manufactured. It makes for a misleading line.


In [6]:
sorted_data = data.sort_values('Weight')

canvas = toyplot.Canvas('4in', '2.6in')

axes = canvas.cartesian(bounds=(41,-3,3,-44),
                        xlabel = 'Weight (lb)',
                        ylabel = 'Horsepower')

axes.plot(sorted_data['Weight'][sorted_data['Horsepower'] != '?'],
          sorted_data['Horsepower'][sorted_data['Horsepower'] != '?'])

# It's usually best to make the y-axis 0-based.
axes.y.domain.min = 0


2000300040005000Weight (lb)0100200Horsepower

In [7]:
toyplot.pdf.render(canvas, 'XY_Scatterplot_Trend_Bad.pdf')
toyplot.svg.render(canvas, 'XY_Scatterplot_Trend_Bad.svg')
toyplot.png.render(canvas, 'XY_Scatterplot_Trend_Bad.png', scale=5)

In [ ]: