What is a dataset?

A dataset is a collection of information (or data) that can be used by a computer. A dataset typically has some number of examples, where each example has features associated with it. Some datasets also include labels, which is an identifying piece of information that is of interest.

What is an example?

An example is a single element of a dataset, typically a row (similar to a row in a table). Multiple examples are used to generalize trends about the dataset as a whole. When predicting the list price of a house, each house would be considered a single example.

Examples are often referred to with the letter $x$.

What is a feature?

A feature is a measurable characteristic that describes an example in a dataset. Features make up the information that a computer can use to learn and make predictions. If your examples are houses, your features might be: the square footage, the number of bedrooms, or the number of bathrooms. Some features are more useful than others. When predicting the list price of a house the number of bedrooms is a useful feature while the number of floorboards is not, even though they both describe the house.

Features are sometimes specified as a single element of an example, $x_i$

What is a label?

A label identifies a piece of information about an example that is of particular interest. In machine learning, the label is the information we want the computer to learn to predict. In our housing example, the label would be the list price of the house.

Labels can be continuous (e.g. price, length, width) or they can be a category label (e.g. color). They are typically specified by the letter $y$.

The Iris Dataset

Here, we use the Iris dataset, available through scikit-learn. Scikit-learn's explanation of the dataset is here.

This dataset contains information on three species of iris flowers (Setosa, Versicolour, and Virginica.

Iris Setosa source Iris Versicolour source Iris Virginica source

Each example has four features (or measurements): sepal length, sepal width, petal length, and petal width. All measurements are in cm.

Petal and sepal of a primrose plant. From wikipedia

Examples

The datasets consists of 150 examples, 50 examples from each species of iris.

Features

The features are the columns of the dataset. In order from left to right (or 0-3) they are: sepal length, sepal width, petal length, and petal width

Our goal

The goal, for this dataset, is to train a computer to predict the species of a new iris plant, given only the measured length and width of its sepal and petal.

Setup

Tell matplotlib to print figures in the notebook. Then import numpy (for numerical data), pyplot (for plotting figures), and ListedColormap (for plotting colors), datasets.

Also create the color maps to use to color the plotted data, and "labelList", which is a list of colored rectangles to use in plotted legends


In [ ]:
# Print figures in the notebook
%matplotlib inline 

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets # Import datasets from scikit-learn

# Import patch for drawing rectangles in the legend
from matplotlib.patches import Rectangle

# Create color maps
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

# Create a legend for the colors, using rectangles for the corresponding colormap colors
labelList = []
for color in cmap_bold.colors:
    labelList.append(Rectangle((0, 0), 1, 1, fc=color))

Import the dataset

Import the dataset and store it to a variable called iris. This dataset is similar to a python dictionary, with the keys: ['DESCR', 'target_names', 'target', 'data', 'feature_names']

The data features are stored in iris.data, where each row is an example from a single flow, and each column is a single feature. The feature names are stored in iris.feature_names. Labels are stored as the numbers 0, 1, or 2 in iris.target, and the names of these labels are in iris.target_names.


In [ ]:
# Import some data to play with
iris = datasets.load_iris()

# List the data keys
print('Keys: ' + str(iris.keys()))
print('Label names: ' + str(iris.target_names))
print('Feature names: ' + str(iris.feature_names))
print('')

# Store the labels (y), label names, features (X), and feature names
y = iris.target       # Labels are stored in y as numbers
labelNames = iris.target_names # Species names corresponding to labels 0, 1, and 2
X = iris.data
featureNames = iris.feature_names

# Show the first five examples
print(iris.data[1:5,:])

Visualizing the data

Visualizing the data can help us better understand the data and make use of it. The following block of code will create a plot of sepal length (x-axis) vs sepal width (y-axis). The colors of the datapoints correspond to the labeled species of iris for that example.

After plotting, look at the data. What do you notice about the way it is arranged?


In [ ]:
# Plot the data

# Sepal length and width
X_sepal = X[:,:2]
# Get the minimum and maximum values with an additional 0.5 border
x_min, x_max = X_sepal[:, 0].min() - .5, X_sepal[:, 0].max() + .5
y_min, y_max = X_sepal[:, 1].min() - .5, X_sepal[:, 1].max() + .5

plt.figure(figsize=(8, 6))

# Plot the training points
plt.scatter(X_sepal[:, 0], X_sepal[:, 1], c=y, cmap=cmap_bold)
plt.xlabel('Sepal length (cm)')
plt.ylabel('Sepal width (cm)')
plt.title('Sepal width vs length')

# Set the plot limits
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)

plt.legend(labelList, labelNames)

plt.show()

Make your own plot

Below, try making your own plots. First, modify the previous code to create a similar plot, showing the petal width vs the petal length. You can start by copying and pasting the previous block of code to the cell below, and modifying it to work.

How is the data arranged differently? Do you think these additional features would be helpful in determining to which species of iris a new plant should be categorized?

What about plotting other feature combinations, like petal length vs sepal length?

Once you've plotted the data several different ways, think about how you would predict the species of a new iris plant, given only the length and width of its sepals and petals.


In [ ]:
# Put your code here!

# Plot the data

# Petal length and width
X_petal = X[:,2:]
# Get the minimum and maximum values with an additional 0.5 border
x_min, x_max = X_petal[:, 0].min() - .5, X_petal[:, 0].max() + .5
y_min, y_max = X_petal[:, 1].min() - .5, X_petal[:, 1].max() + .5

plt.figure(figsize=(8, 6))

# Plot the training points
plt.scatter(X_petal[:, 0], X_petal[:, 1], c=y, cmap=cmap_bold)
plt.xlabel('Petal length (cm)')
plt.ylabel('Petal width (cm)')
plt.title('Petal width vs length')

# Set the plot limits
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)

plt.legend(labelList, labelNames)

plt.show()

Training and Testing Sets

In order to evaluate our data properly, we need to divide our dataset into training and testing sets.

Training Set

A portion of the data, usually a majority, used to train a machine learning classifier. These are the examples that the computer will learn in order to try to predict data labels.

Testing Set

A portion of the data, smaller than the training set (usually about 30%), used to test the accuracy of the machine learning classifier. The computer does not "see" this data while learning, but tries to guess the data labels. We can then determine the accuracy of our method by determining how many examples it got correct.

Creating training and testing sets

Below, we create a training and testing set from the iris dataset using using the train_test_split() function.


In [ ]:
from sklearn import cross_validation

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.3)
print('Original dataset size: ' + str(X.shape))
print('Training dataset size: ' + str(X_train.shape))
print('Test dataset size: ' + str(X_test.shape))

More information on different methods for creating training and testing sets is available at scikit-learn's crossvalidation page.


In [ ]: