In [1]:

    
from __future__ import print_function

When analyzing data, I usually use the following three modules. I use pandas for data management, filtering, grouping, and processing. I use numpy for basic array math. I use toyplot for rendering the charts.



In [2]:

    
import pandas
import numpy
import toyplot
import toyplot.pdf
import toyplot.png
import toyplot.svg

print('Pandas version:  ', pandas.__version__)
print('Numpy version:   ', numpy.__version__)
print('Toyplot version: ', toyplot.__version__)









    



Pandas version:   0.19.2
Numpy version:    1.12.0
Toyplot version:  0.14.0-dev

Load in the "auto" dataset. This is a fun collection of data on cars manufactured between 1970 and 1982. The source for this data can be found at https://archive.ics.uci.edu/ml/datasets/Auto+MPG.

The data are stored in a text file containing columns of data. We use the pandas.read_table() method to parse the data and load it in a pandas DataFrame. The file does not contain a header row, so we need to specify the names of the columns manually.



In [3]:

    
column_names = ['MPG',
                'Cylinders',
                'Displacement',
                'Horsepower',
                'Weight',
                'Acceleration',
                'Model Year',
                'Origin Index',
                'Car Name']
data = pandas.read_table('auto-mpg.data',
                         delim_whitespace=True,
                         names=column_names,
                         index_col=False)

The origin column indicates the country of origin for the car manufacture. It has three numeric values, 1, 2, or 3. These indicate USA, Europe, or Japan, respectively. Replace the origin column with a string representing the country name.



In [4]:

    
country_map = pandas.Series(index=[1,2,3],
                            data=['USA', 'Europe', 'Japan'])
data['Origin'] = numpy.array(country_map[data['Origin Index']])

In this plot we are going to show the trend of the average miles per gallon (MPG) rating for subsequent model years separated by country of origin. This time period saw a significant increase in MPG driven by the U.S. fuel crisis. We can use the pivot_table feature of pandas to get this information from the data. (Excel and other spreadsheets have similar functionality.)



In [5]:

    
average_mpg_per_year = data.pivot_table(index='Model Year',
                                        columns='Origin',
                                        values='MPG',
                                        aggfunc='mean')
average_mpg_per_year

Use toyplot to make a plot of the MPG for every car in the database organized by year and colored by origin.



In [6]:

    
canvas = toyplot.Canvas('4in', '2.6in')

axes = canvas.cartesian(bounds=(41,-1,6,-43),
                        xlabel = 'Model Year',
                        ylabel = 'MPG')

colormap = toyplot.color.CategoricalMap()

axes.scatterplot(data['Model Year'] + 1900 + 0.2*(data['Origin Index']-2),
                 data['MPG'],
                 size=4,
                 opacity=0.75,
                 color=(numpy.array(data['Origin Index'])-1,colormap))

for country in country_map:
    series = average_mpg_per_year[country]
    x = series.index[-1] + 1900
    y = numpy.array(series)[-1]
    axes.text(x, y, country,
              style={"text-anchor":"start",
                     "-toyplot-anchor-shift":"15px"})

# It's usually best to make the y-axis 0-based.
axes.y.domain.min = 0

# Toyplot is sometimes inaccurate in judging the width of labels.
axes.x.domain.max = 1984.2

# The labels can make for odd tick placement.
# Place them manually
axes.x.ticks.locator = \
    toyplot.locator.Explicit([1970,1974,1978,1982])









    





            
            
                Save as .csv



In [7]:

    
toyplot.pdf.render(canvas, 'Detail_MultiSeries.pdf')
toyplot.svg.render(canvas, 'Detail_MultiSeries.svg')
toyplot.png.render(canvas, 'Detail_MultiSeries.png', scale=5)

Now use toyplot to plot this data along with trend lines.



In [8]:

    
canvas = toyplot.Canvas('4in', '2.6in')

axes = canvas.cartesian(bounds=(41,-1,6,-43),
                        xlabel = 'Model Year',
                        ylabel = 'MPG')

colormap = toyplot.color.CategoricalMap()

axes.scatterplot(data['Model Year'] + 1900 + 0.2*(data['Origin Index']-2),
                 data['MPG'],
                 size=4,
                 opacity=1.0,
                 color=(numpy.array(data['Origin Index'])-1,colormap))

for column in country_map:
    series = average_mpg_per_year[column]
    x = series.index + 1900
    y = numpy.array(series)
    axes.plot(x, y, opacity=0.5)
    axes.text(x[-1], y[-1], column,
              style={"text-anchor":"start",
                     "-toyplot-anchor-shift":"10px"})

# It's usually best to make the y-axis 0-based.
axes.y.domain.min = 0

# Toyplot is sometimes inaccurate in judging the width of labels.
axes.x.domain.max = 1984.2

# The labels can make for odd tick placement.
# Place them manually
axes.x.ticks.locator = \
    toyplot.locator.Explicit([1970,1974,1978,1982])









    





            
            
                Save as .csv



In [9]:

    
toyplot.pdf.render(canvas, 'Detail_MultiSeries_Trend.pdf')
toyplot.svg.render(canvas, 'Detail_MultiSeries_Trend.svg')
toyplot.png.render(canvas, 'Detail_MultiSeries_Trend.png', scale=5)



In [ ]:

Origin	Europe	Japan	USA
Model Year
70	25.200000	25.500000	15.272727
71	28.750000	29.500000	18.100000
72	22.000000	24.200000	16.277778
73	24.000000	20.000000	15.034483
74	27.000000	29.333333	18.333333
75	24.500000	27.500000	17.550000
76	24.250000	28.000000	19.431818
77	29.250000	27.416667	20.722222
78	24.950000	29.687500	21.772727
79	30.450000	32.950000	23.478261
80	37.288889	35.400000	25.914286
81	31.575000	32.958333	27.530769
82	40.000000	34.888889	29.450000