Most machine learning algorithms implemented in scikit-learn expect data to be stored in a
two-dimensional array or matrix. The arrays can be
either numpy arrays, or in some cases scipy.sparse matrices.
The size of the array is expected to be n_samples x n_features.
Today we are going to use the `iris` dataset which comes with sklearn. It's fairly small as we'll see shortly.
TIP: Commonly, machine learning algorithms will require your data to be standardized, normalized or even reguarlized and preprocessed. In `sklearn` the data must also take on a certain structure as discussed above.
QUICK QUESTION:
[n_samples x n_features] data array, what do you think
In [ ]:
from sklearn.datasets import load_iris
iris = load_iris()
print(type(iris.data))
print(type(iris.target))
In [ ]:
import pandas as pd
import numpy as np
%matplotlib inline
In [ ]:
import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()
pd.DataFrame({'feature name': iris.feature_names})
In [ ]:
import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()
pd.DataFrame({'target name': iris.target_names})
sklearnTIP: all included datasets for have at leastfeature_namesand sometimestarget_names
Shape and representation
In [ ]:
import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()
# How many data points (rows) x how many features (columns)
print(iris.data.shape)
print(iris.target.shape)
Sneak a peek at data (a reminder of your pandas dataframe methods)
In [ ]:
# convert to pandas df (adding real column names)
iris.df = pd.DataFrame(iris.data,
columns = iris.feature_names)
# first few rows
iris.df.head()
Describe the dataset with some summary statitsics
In [ ]:
# summary stats
iris.df.describe()
iris dataset. It has no missing values. It's already in numpy arrays and has the correct shape for sklearn. However we could try standardization and/or normalization. (later, in the transforms section, we will show one hot encoding, a preprocessing step)What you might have to do before using a learner in `sklearn`:
pandas)Features should end up in a numpy.ndarray (hence numeric) and labels in a list.
Data options:
If you use your own data or "real-world" data you will likely have to do some data wrangling and need to leverage pandas for some data manipulation.
FYI: you'll commonly see the data or feature set (ML word for data without it's labels) represented as a capital X and the targets or labels (if we have them) represented as a lowercase y. This is because the data is a 2D array or list of lists and the targets are a 1D array or simple list.
In [ ]:
# Standardization aka scaling
from sklearn import preprocessing, datasets
# make sure we have iris loaded
iris = datasets.load_iris()
X, y = iris.data, iris.target
# scale it to a gaussian distribution
X_scaled = preprocessing.scale(X)
# how does it look now
pd.DataFrame(X_scaled).head()
In [ ]:
# let's just confirm our standardization worked (mean is 0 w/ unit variance)
pd.DataFrame(X_scaled).describe()
# also could:
#print(X_scaled.mean(axis = 0))
#print(X_scaled.std(axis = 0))
PRO TIP: To save our standardization and reapply later (say to the test set or some new data), create a transformer object like so:
scaler = preprocessing.StandardScaler().fit(X_train) # apply to a new dataset (e.g. test set): scaler.transform(X_test)
In [ ]:
# Standardization aka scaling
from sklearn import preprocessing, datasets
# make sure we have iris loaded
iris = datasets.load_iris()
X, y = iris.data, iris.target
# scale it to a gaussian distribution
X_norm = preprocessing.normalize(X, norm='l1')
# how does it look now
pd.DataFrame(X_norm).tail()
In [ ]:
# let's just confirm our standardization worked (mean is 0 w/ unit variance)
pd.DataFrame(X_norm).describe()
# cumulative sum of normalized and original data:
#print(pd.DataFrame(X_norm.cumsum().reshape(X.shape)).tail())
#print(pd.DataFrame(X).cumsum().tail())
# unit norm (convert to unit vectors) - all row sums should be 1 now
X_norm.sum(axis = 1)
PRO TIP: To save our normalization (like standardization above) and reapply later (say to the test set or some new data), create a transformer object like so:
normalizer = preprocessing.Normalizer().fit(X_train) # apply to a new dataset (e.g. test set): normalizer.transform(X_test)
Created by a Microsoft Employee.
The MIT License (MIT)
Copyright (c) 2016 Micheleen Harris