In [1]:
from __future__ import print_function
When analyzing data, I usually use the following three modules. I use pandas for data management, filtering, grouping, and processing. I use numpy for basic array math. I use toyplot for rendering the charts.
In [2]:
import pandas
import numpy
import toyplot
import toyplot.pdf
import toyplot.png
import toyplot.svg
print('Pandas version: ', pandas.__version__)
print('Numpy version: ', numpy.__version__)
print('Toyplot version: ', toyplot.__version__)
Load in the "auto" dataset. This is a fun collection of data on cars manufactured between 1970 and 1982. The source for this data can be found at https://archive.ics.uci.edu/ml/datasets/Auto+MPG.
The data are stored in a text file containing columns of data. We use the pandas.read_table() method to parse the data and load it in a pandas DataFrame. The file does not contain a header row, so we need to specify the names of the columns manually.
In [3]:
column_names = ['MPG',
'Cylinders',
'Displacement',
'Horsepower',
'Weight',
'Acceleration',
'Model Year',
'Origin',
'Car Name']
data = pandas.read_table('auto-mpg.data',
delim_whitespace=True,
names=column_names,
index_col=False)
Use toyplot to plot the measurements of horsepower vs weight. We should expect a general trend to higher horsepower to weight with some outliers (such as for sports cars).
In [4]:
canvas = toyplot.Canvas('4in', '2.6in')
axes = canvas.cartesian(bounds=(41,-3,3,-44),
xlabel = 'Weight (lb)',
ylabel = 'Horsepower')
# Note that this data has some invalid measurements for Horsepower. Thus, we need
# to filter those rows out. That is what the [data['Horsepower'] != '?'] is for
axes.scatterplot(data['Weight'][data['Horsepower'] != '?'],
data['Horsepower'][data['Horsepower'] != '?'])
# It's usually best to make the y-axis 0-based.
axes.y.domain.min = 0
In [5]:
toyplot.pdf.render(canvas, 'XY_Scatterplot.pdf')
toyplot.svg.render(canvas, 'XY_Scatterplot.svg')
toyplot.png.render(canvas, 'XY_Scatterplot.png', scale=5)
Repeat the plot, but with a trend line. This is a bad example as the order of the size of the cars is ore arbitrary than, for example, the year they were manufactured. It makes for a misleading line.
In [6]:
sorted_data = data.sort_values('Weight')
canvas = toyplot.Canvas('4in', '2.6in')
axes = canvas.cartesian(bounds=(41,-3,3,-44),
xlabel = 'Weight (lb)',
ylabel = 'Horsepower')
axes.plot(sorted_data['Weight'][sorted_data['Horsepower'] != '?'],
sorted_data['Horsepower'][sorted_data['Horsepower'] != '?'])
# It's usually best to make the y-axis 0-based.
axes.y.domain.min = 0
In [7]:
toyplot.pdf.render(canvas, 'XY_Scatterplot_Trend_Bad.pdf')
toyplot.svg.render(canvas, 'XY_Scatterplot_Trend_Bad.svg')
toyplot.png.render(canvas, 'XY_Scatterplot_Trend_Bad.png', scale=5)
In [ ]: