A dataset is a collection of information (or data) that can be used by a computer. A dataset typically has some number of examples, where each example has features associated with it. Some datasets also include labels, which is an identifying piece of information that is of interest.
An example is a single element of a dataset, typically a row (similar to a row in a table). Multiple examples are used to generalize trends about the dataset as a whole. When predicting the list price of a house, each house would be considered a single example.
Examples are often referred to with the letter $x$.
A feature is a measurable characteristic that describes an example in a dataset. Features make up the information that a computer can use to learn and make predictions. If your examples are houses, your features might be: the square footage, the number of bedrooms, or the number of bathrooms. Some features are more useful than others. When predicting the list price of a house the number of bedrooms is a useful feature while the color of the walls is not, even though they both describe the house.
Features are sometimes specified as a single element of an example, $x_i$
A label identifies a piece of information about an example that is of particular interest. In machine learning, the label is the information we want the computer to learn to predict. In our housing example, the label would be the list price of the house.
Labels can be continuous (e.g. price, length, width) or they can be a category label (e.g. color, species of plant/animal). They are typically specified by the letter $y$.
Here, we use the Diabetes dataset, available through scikit-learn. This dataset contains information related to specific patients and disease progression of diabetes.
The datasets consists of 442 examples, each representing an individual diabetes patient.
The dataset contains 10 features: Age, sex, body mass index, average blood pressure, and 6 blood serum measurements.
The target is a quantitative measure of disease progression after one year.
The goal, for this dataset, is to train a computer to predict the progression of diabetes after one year.
In [ ]:
# Print figures in the notebook
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets # Import datasets from scikit-learn
import matplotlib.cm as cm
from matplotlib.colors import Normalize
Import the dataset and store it to a variable called diabetes. This dataset is similar to a python dictionary, with the keys: ['DESCR', 'target', 'data', 'feature_names']
The data features are stored in diabetes.data, where each row is an example from a single patient, and each column is a single feature. The feature names are stored in diabetes.feature_names. Target values are stored in diabetes.target.
In [ ]:
# Import some data to play with
diabetes = datasets.load_diabetes()
# List the data keys
print('Keys: ' + str(diabetes.keys()))
print('Feature names: ' + str(diabetes.feature_names))
print('')
# Store the labels (y), features (X), and feature names
y = diabetes.target # Labels are stored in y as numbers
X = diabetes.data
featureNames = diabetes.feature_names
# Show the first five examples
X[:5,:]
Visualizing the data can help us better understand the data and make use of it. The following block of code will create a plot of serum measurement 1 (x-axis) vs serum measurement 6 (y-axis). The level of diabetes progression has been mapped to fit in the [0,1] range and is shown as a color scale.
In [ ]:
norm = Normalize(vmin=y.min(), vmax=y.max()) # need to normalize target to [0,1] range for use with colormap
plt.scatter(X[:, 4], X[:, 9], c=norm(y), cmap=cm.bone_r)
plt.colorbar()
plt.xlabel('Serum Measurement 1 (s1)')
plt.ylabel('Serum Measurement 6 (s6)')
plt.show()
In [ ]:
# Put your code here!
In order to evaluate our data properly, we need to divide our dataset into training and testing sets.
Below, we create a training and testing set from the iris dataset using using the train_test_split() function.
In [ ]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
print('Original dataset size: ' + str(X.shape))
print('Training dataset size: ' + str(X_train.shape))
print('Test dataset size: ' + str(X_test.shape))
Crossvalidation allows us to use as much of our data as possible for training without training on our test data. We use it to split our training set into training and validation sets.
The KFold() function returns an iterable with pairs of indices for training and testing data.
In [ ]:
from sklearn.model_selection import KFold
# Older versions of scikit learn used n_folds instead of n_splits
kf = KFold(n_splits=5)
for trainInd, valInd in kf.split(X_train):
X_tr = X_train[trainInd,:]
y_tr = y_train[trainInd]
X_val = X_train[valInd,:]
y_val = y_train[valInd]
print("%s %s" % (X_tr.shape, X_val.shape))
More information on different methods for creating training and testing sets is available at scikit-learn's crossvalidation page.
In [ ]: