Introduction to data analysis using machine learning

01. The Dataset

by David Taylor, www.prooffreader.com (blog) www.dtdata.io (hire me!)

For links to more material including a slideshow explaining all this stuff in further detail, please see the front page of this GitHub repo.

This is notebook 1 of 8. The next notebook is: [02. Clustering with K-Means]

Quicklinks: [01] [02] [03] [04] [05] [06] [07] [08]

The dataset is invented. I took the well-known Wines dataset from http://archive.ics.uci.edu/ml/datasets/Wine, got rid of most of the features and changed others and invented one.

The dataset is now called fruit. It allows us to compare apples to oranges! (Also apples to pears, since that's the French expression).

The columns are:

fruit_id: 1-3, numeric id for:
fruit_name: orange, pear and apple, corresponding to fruit == 1, 2, and 3, respectively
color_id: 1-6, numeric id for:
color_name: blue, brown, green, orange, red, yellow, corresponding to color == 1-6, respectively.
elongatednessness: 0-1, continuous. A concept borrowed from the famous seeds dataset (which uses the inverse, compactness). If one were to take a two-dimensional image of the fruit (presumably in a random orientation) and make the smallest ellipse it would fit within, elongatednessness is the length of the long axis divided by the length of the short axis, minus 1. An infinitely long line has an infinite elongatedness, a perfect circle (or square, for that matter) has an elongatedness of zero.
weight: in grams
sweetness: in totally fictional units; I just took the (unit-unspecified) values for proline from Wines and fudged them a bit.
acidity: same note as sweetness, except original column was OD280/315.

Waitaminute! Blue fruit?

I added some noise to this dataset by pretending the color names were assigned manually, and a certain number of people who performed the task had different sorts of color-blindedness.

Note that, like the Wines dataset, it's easy to get near- (but, unlike Wines, not totally) perfect classification. I thought the signal-to-noise ratio should be kept relatively high (but again, not perfect) for beginners.

1. Import libraries and datafile



In [1]:

    
from __future__ import (absolute_import, division,
                        print_function, unicode_literals)
# I only use Python 3.4.x+, hopefully the above statement will make this notebook
# work in Python 2.7.x

import sys
print(sys.version)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.ensemble import RandomForestClassifier
df = pd.read_csv('fruit.csv')

fruitnames = {1: 'Orange', 2: 'Pear', 3: 'Apple'}
colors = {1: '#e09028', 2: '#55aa33', 3: '#cc3333'}
fruitlist = ['Orange', 'Pear', 'Apple']
# It's a trifle inelegant to use both a list and a dict,
# but fruitlist is zero-indexed and fruitnames is one-indexed.

df.sort(['sweetness', 'acidity', 'weight', 'elongatedness'], inplace=True)
df.reset_index(drop=True, inplace=True)
df.tail(10)









    



3.4.2 |Anaconda 2.1.0 (64-bit)| (default, Jan  9 2015, 10:32:40) [MSC v.1600 64 bit (AMD64)]






    Out[1]:






  
    
      
      fruit_id
      fruit_name
      color_id
      color_name
      elongatedness
      weight
      sweetness
      acidity
    
  
  
    
      169
       1
       orange
       4
       orange
       0.08
       144
       3.58
       1290
    
    
      170
       1
       orange
       5
          red
       0.11
       182
       3.58
       1295
    
    
      171
       1
       orange
       4
       orange
       0.11
       144
       3.59
       1035
    
    
      172
       1
       orange
       4
       orange
       0.09
       143
       3.63
       1015
    
    
      173
       2
         pear
       6
       yellow
       0.47
       123
       3.64
        380
    
    
      174
       2
         pear
       6
       yellow
       0.56
       126
       3.69
        465
    
    
      175
       1
       orange
       5
          red
       0.11
       189
       3.71
        780
    
    
      176
       1
       orange
       4
       orange
       0.19
       144
       3.82
        845
    
    
      177
       1
       orange
       5
          red
       0.09
       191
       3.92
       1065
    
    
      178
       1
       orange
       2
        brown
       0.15
       152
       4.00
       1035

2. Explore the dataset

Count the instance labels:



In [2]:

    
# count the instance labels
for fruit in df.fruit_id.unique():
    print("{} instances of fruit #{}, {}".format(len(df[df.fruit_id == fruit]),
                                                    fruit, fruitnames[fruit]))









    



49 instances of fruit #3, Apple
71 instances of fruit #2, Pear
59 instances of fruit #1, Orange

Describe the four numeric features:



In [3]:

    
df[['elongatedness', 'weight', 'sweetness', 'acidity']].describe()









    Out[3]:






  
    
      
      elongatedness
      weight
      sweetness
      acidity
    
  
  
    
      count
       179.000000
       179.000000
       179.000000
        179.000000
    
    
      mean
         0.296369
       144.340782
         2.606034
        745.849162
    
    
      std
         0.161922
        19.280632
         0.712020
        314.332206
    
    
      min
         0.020000
       105.000000
         1.270000
        278.000000
    
    
      25%
         0.150000
       129.000000
         1.925000
        501.000000
    
    
      50%
         0.280000
       143.000000
         2.780000
        672.000000
    
    
      75%
         0.430000
       156.000000
         3.170000
        985.000000
    
    
      max
         0.690000
       198.000000
         4.000000
       1680.000000

View a crosstab of colors and fruit. See introductory note if you're confused about the blue pears.



In [4]:

    
pd.crosstab(df.fruit_name, df.color_name)

View all pairwise plots of features in a scatterplot matrix:



In [5]:

    
_ = pd.scatter_matrix(df, figsize=(14,14), diagonal='kde', alpha=0.6, color=[colors[x] for x in list(df.fruit_id)])

From the above, it appears sweetness and acidity should be a good candidate for clustering:



In [6]:

    
df.plot(kind='scatter', x='sweetness', y='acidity', color='#228888', s=92, alpha=0.3)









    Out[6]:





<matplotlib.axes._subplots.AxesSubplot at 0xa7eb3c8>

... but not too good. Also, the labels do not perfectly correspond with the clusters (see how there are green dots inside the red and orange regions?), making it a good candidate to demonstrate classification.



In [7]:

    
for i in range(3):
    plt.scatter(df[df.fruit_id == i+1].sweetness, df[df.fruit_id == i+1].acidity, 
                s=44, c=[colors[x] for x in list(df[df.fruit_id == i+1].fruit_id)], 
                alpha=0.5, label=fruitnames[i+1])
plt.xlabel('Sweetness')
plt.ylabel('Acidity')
plt.legend()
plt.show()

To see the other continuous numeric variables, let's plot Sweetness vs. Weight.



In [8]:

    
for i in range(3):
    plt.scatter(df[df.fruit_id == i+1].weight, df[df.fruit_id == i+1].elongatedness, 
                s=44, c=[colors[x] for x in list(df[df.fruit_id == i+1].fruit_id)], 
                alpha=0.5, label=fruitnames[i+1])
plt.xlabel('Weight')
plt.ylabel('Sweetness')
plt.legend()
plt.show()



In [ ]:

	fruit_id	fruit_name	color_id	color_name	elongatedness	weight	sweetness	acidity
169	1	orange	4	orange	0.08	144	3.58	1290
170	1	orange	5	red	0.11	182	3.58	1295
171	1	orange	4	orange	0.11	144	3.59	1035
172	1	orange	4	orange	0.09	143	3.63	1015
173	2	pear	6	yellow	0.47	123	3.64	380
174	2	pear	6	yellow	0.56	126	3.69	465
175	1	orange	5	red	0.11	189	3.71	780
176	1	orange	4	orange	0.19	144	3.82	845
177	1	orange	5	red	0.09	191	3.92	1065
178	1	orange	2	brown	0.15	152	4.00	1035

	elongatedness	weight	sweetness	acidity
count	179.000000	179.000000	179.000000	179.000000
mean	0.296369	144.340782	2.606034	745.849162
std	0.161922	19.280632	0.712020	314.332206
min	0.020000	105.000000	1.270000	278.000000
25%	0.150000	129.000000	1.925000	501.000000
50%	0.280000	143.000000	2.780000	672.000000
75%	0.430000	156.000000	3.170000	985.000000
max	0.690000	198.000000	4.000000	1680.000000