What is a dataset?

A dataset is a collection of information (or data) that can be used by a computer. A dataset typically has some number of examples, where each example has features associated with it. Some datasets also include labels, which is an identifying piece of information that is of interest.

What is an example?

An example is a single element of a dataset, typically a row (similar to a row in a table). Multiple examples are used to generalize trends about the dataset as a whole. When predicting the list price of a house, each house would be considered a single example.

Examples are often referred to with the letter $x$.

What is a feature?

A feature is a measurable characteristic that describes an example in a dataset. Features make up the information that a computer can use to learn and make predictions. If your examples are houses, your features might be: the square footage, the number of bedrooms, or the number of bathrooms. Some features are more useful than others. When predicting the list price of a house the number of bedrooms is a useful feature while the color of the walls is not, even though they both describe the house.

Features are sometimes specified as a single element of an example, $x_i$

What is a label?

A label identifies a piece of information about an example that is of particular interest. In machine learning, the label is the information we want the computer to learn to predict. In our housing example, the label would be the list price of the house.

Labels can be continuous (e.g. price, length, width) or they can be a category label (e.g. color, species of plant/animal). They are typically specified by the letter $y$.

The Iris Dataset

Here, we use the Iris dataset, available through scikit-learn. Scikit-learn's explanation of the dataset is here.

This dataset contains information on three species of iris flowers: Setosa, Versicolour, and Virginica.

Iris Setosa source Iris Versicolour source Iris Virginica source

Each example has four features (or measurements): sepal length, sepal width, petal length, and petal width. All measurements are in cm.

Petal and sepal of a primrose plant. From wikipedia

Examples

The datasets consists of 150 examples, 50 examples from each species of iris.

Features

The features are the columns of the dataset. In order from left to right (or 0-3) they are: sepal length, sepal width, petal length, and petal width.

Target

The target value is the species of Iris. The three three species are Setosa, Versicolour, and Virginica.

Our goal

The goal, for this dataset, is to train a computer to predict the species of a new iris plant, given only the measured length and width of its sepal and petal.

Setup

Tell matplotlib to print figures in the notebook. Then import numpy (for numerical data), pyplot (for plotting figures) ListedColormap (for plotting colors), and datasets (to download the iris dataset from scikit-learn).

Also create the color maps to use to color the plotted data, and "labelList", which is a list of colored rectangles to use in plotted legends


In [ ]:
# Print figures in the notebook
%matplotlib inline 

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets # Import datasets from scikit-learn

# Import patch for drawing rectangles in the legend
from matplotlib.patches import Rectangle

# Create color maps
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

# Create a legend for the colors, using rectangles for the corresponding colormap colors
labelList = []
for color in cmap_bold.colors:
    labelList.append(Rectangle((0, 0), 1, 1, fc=color))

Import the dataset

Import the dataset and store it to a variable called iris. This dataset is similar to a python dictionary, with the keys: ['DESCR', 'target_names', 'target', 'data', 'feature_names']

The data features are stored in iris.data, where each row is an example from a single flower, and each column is a single feature. The feature names are stored in iris.feature_names. Labels are stored as the numbers 0, 1, or 2 in iris.target, and the names of these labels are in iris.target_names.


In [ ]:
# Import some data to play with
iris = datasets.load_iris()

# List the data keys
print('Keys: ' + str(iris.keys()))
print('Label names: ' + str(iris.target_names))
print('Feature names: ' + str(iris.feature_names))
print('')

# Store the labels (y), label names, features (X), and feature names
y = iris.target       # Labels are stored in y as numbers
labelNames = iris.target_names # Species names corresponding to labels 0, 1, and 2
X = iris.data
featureNames = iris.feature_names

# Show the first five examples
print(X[1:5,:])

Visualizing the data

Visualizing the data can help us better understand the data and make use of it. The following block of code will create a plot of sepal length (x-axis) vs sepal width (y-axis). The colors of the datapoints correspond to the labeled species of iris for that example.

After plotting, look at the data. What do you notice about the way it is arranged?


In [ ]:
# Plot the data

# Sepal length and width
X_sepal = X[:,:2]
# Get the minimum and maximum values with an additional 0.5 border
x_min, x_max = X_sepal[:, 0].min() - .5, X_sepal[:, 0].max() + .5
y_min, y_max = X_sepal[:, 1].min() - .5, X_sepal[:, 1].max() + .5

plt.figure(figsize=(8, 6))

# Plot the training points
plt.scatter(X_sepal[:, 0], X_sepal[:, 1], c=y, cmap=cmap_bold)
plt.xlabel('Sepal length (cm)')
plt.ylabel('Sepal width (cm)')
plt.title('Sepal width vs length')

# Set the plot limits
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)

plt.legend(labelList, labelNames)

plt.show()

Make your own plot

Below, try making your own plots. First, modify the previous code to create a similar plot, showing the petal width vs the petal length. You can start by copying and pasting the previous block of code to the cell below, and modifying it to work.

How is the data arranged differently? Do you think these additional features would be helpful in determining to which species of iris a new plant should be categorized?

What about plotting other feature combinations, like petal length vs sepal length?

Once you've plotted the data several different ways, think about how you would predict the species of a new iris plant, given only the length and width of its sepals and petals.


In [ ]:
# Put your code here!

Training and Testing Sets

In order to evaluate our data properly, we need to divide our dataset into training and testing sets.

  • Training Set - Portion of the data used to train a machine learning algorithm. These are the examples that the computer will learn from in order to try to predict data labels.
  • Testing Set - Portion of the data (usually 10-30%) not used in training, used to evaluate performance. The computer does not "see" this data while learning, but tries to guess the data labels. We can then determine the accuracy of our method by determining how many examples it got correct.
  • Validation Set - (Optional) A third section of data used for parameter tuning or classifier selection. When selecting among many classifiers, or when a classifier parameter must be adjusted (tuned), a this data is used like a test set to select the best parameter value(s). The final performance is then evaluated on the remaining, previously unused, testing set.

Creating training and testing sets

Below, we create a training and testing set from the iris dataset using using the train_test_split() function.


In [ ]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

print('Original dataset size: ' + str(X.shape))
print('Training dataset size: ' + str(X_train.shape))
print('Test dataset size: ' + str(X_test.shape))

Create validation set using crossvalidation

Crossvalidation allows us to use as much of our data as possible for training without training on our test data. We use it to split our training set into training and validation sets.

  • Divide data into multiple equal sections (called folds)
  • Hold one fold out for validation and train on the other folds
  • Repeat using each fold as validation

The KFold() function returns an iterable with pairs of indices for training and testing data.


In [ ]:
from sklearn.model_selection import KFold

# Older versions of scikit learn used n_folds instead of n_splits
kf = KFold(n_splits=5)
for trainInd, valInd in kf.split(X_train):
    X_tr = X_train[trainInd,:]
    y_tr = y_train[trainInd]
    X_val = X_train[valInd,:]
    y_val = y_train[valInd]
    print("%s %s" % (X_tr.shape, X_val.shape))

More information on different methods for creating training and testing sets is available at scikit-learn's crossvalidation page.


In [ ]: