Visualize and Explore

The Dataset - Fisher's Irises

Most machine learning algorithms implemented in scikit-learn expect data to be stored in a two-dimensional array or matrix. The arrays can be either numpy arrays, or in some cases scipy.sparse matrices. The size of the array is expected to be n_samples x n_features.

n_samples: The number of samples: each sample is an item to process (e.g. classify). A sample can be a document, a picture, a sound, a video, an astronomical object, a row in database or CSV file, or whatever you can describe with a fixed set of quantitative traits.
n_features: The number of features or distinct traits that can be used to describe each item in a quantitative manner. Features are generally real-valued, but may be boolean or discrete-valued in some cases.

The number of features must be fixed in advance. However it can be very high dimensional (e.g. millions of features) with most of them being zeros for a given sample. This is a case where `scipy.sparse` matrices can be useful, in that they are much more memory-efficient than numpy arrays.

If there are labels or targets, they need to be stored in one-dimensional arrays or lists.

Today we are going to use the `iris` dataset which comes with sklearn. It's fairly small as we'll see shortly.

Remember our ML TIP: Ask sharp questions.
e.g. What type of flower is this (pictured below) closest to of the three given classes?

(This links out to source)

Labels (species names/classes):

(This links out to source)

TIP: Commonly, machine learning algorithms will require your data to be standardized, normalized or even reguarlized and preprocessed. In `sklearn` the data must also take on a certain structure as discussed above.

QUICK QUESTION:

What do you expect this data set to be if you are trying to recognize an iris species?

For our [n_samples x n_features] data array, what do you think
- the samples are?
- the features are?



In [ ]:

    
from sklearn.datasets import load_iris

iris = load_iris()

print(type(iris.data))
print(type(iris.target))

Let's Dive In!



In [ ]:

    
import pandas as pd
import numpy as np

%matplotlib inline

Features (aka columns in data)



In [ ]:

    
import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()

pd.DataFrame({'feature name': iris.feature_names})

Targets (aka labels)



In [ ]:

    
import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()

pd.DataFrame({'target name': iris.target_names})

sklearn TIP: all included datasets for have at least feature_names and sometimes target_names

Get to know the data - visualize and explore

Features (columns/measurements) come from this diagram (links out to source on kaggle):
Shape
Peek at data
Summaries

Shape and representation



In [ ]:

    
import pandas as pd
from sklearn import datasets

iris = datasets.load_iris()

# How many data points (rows) x how many features (columns)
print(iris.data.shape)
print(iris.target.shape)

Sneak a peek at data (a reminder of your pandas dataframe methods)



In [ ]:

    
# convert to pandas df (adding real column names)
iris.df = pd.DataFrame(iris.data, 
                       columns = iris.feature_names)


# first few rows
iris.df.head()

Describe the dataset with some summary statitsics



In [ ]:

    
# summary stats
iris.df.describe()

We don't have to do much with the iris dataset. It has no missing values. It's already in numpy arrays and has the correct shape for sklearn. However we could try standardization and/or normalization. (later, in the transforms section, we will show one hot encoding, a preprocessing step)

Preprocessing (Bonus Material)

What you might have to do before using a learner in `sklearn`:

Non-numerics transformed to numeric (tip: use applymap() method from pandas)

Fill in missing values
Standardization
Normalization
Encoding categorical features (e.g. one-hot encoding or dummy variables)

Features should end up in a numpy.ndarray (hence numeric) and labels in a list.

Data options:

Use pre-processed datasets from scikit-learn
Create your own
Read from a file

If you use your own data or "real-world" data you will likely have to do some data wrangling and need to leverage pandas for some data manipulation.

Standardization - make our data look like a standard Gaussian distribution (commonly needed for `sklearn` learners)

FYI: you'll commonly see the data or feature set (ML word for data without it's labels) represented as a capital X and the targets or labels (if we have them) represented as a lowercase y. This is because the data is a 2D array or list of lists and the targets are a 1D array or simple list.



In [ ]:

    
# Standardization aka scaling
from sklearn import preprocessing, datasets

# make sure we have iris loaded
iris = datasets.load_iris()

X, y = iris.data, iris.target

# scale it to a gaussian distribution
X_scaled = preprocessing.scale(X)

# how does it look now
pd.DataFrame(X_scaled).head()



In [ ]:

    
# let's just confirm our standardization worked (mean is 0 w/ unit variance)
pd.DataFrame(X_scaled).describe()

# also could:
#print(X_scaled.mean(axis = 0))
#print(X_scaled.std(axis = 0))

PRO TIP: To save our standardization and reapply later (say to the test set or some new data), create a transformer object like so:
scaler = preprocessing.StandardScaler().fit(X_train)
# apply to a new dataset (e.g. test set):
scaler.transform(X_test)

Normalization - scaling samples individually to have unit norm

This type of scaling is really important if doing some downstream transformations and learning (see sklearn docs here for more) where similarity of pairs of samples is examined
A basic intro to normalization and the unit vector can be found here



In [ ]:

    
# Standardization aka scaling
from sklearn import preprocessing, datasets

# make sure we have iris loaded
iris = datasets.load_iris()

X, y = iris.data, iris.target

# scale it to a gaussian distribution
X_norm = preprocessing.normalize(X, norm='l1')

# how does it look now
pd.DataFrame(X_norm).tail()



In [ ]:

    
# let's just confirm our standardization worked (mean is 0 w/ unit variance)
pd.DataFrame(X_norm).describe()

# cumulative sum of normalized and original data:
#print(pd.DataFrame(X_norm.cumsum().reshape(X.shape)).tail())
#print(pd.DataFrame(X).cumsum().tail())

# unit norm (convert to unit vectors) - all row sums should be 1 now
X_norm.sum(axis = 1)

PRO TIP: To save our normalization (like standardization above) and reapply later (say to the test set or some new data), create a transformer object like so:
normalizer = preprocessing.Normalizer().fit(X_train)
# apply to a new dataset (e.g. test set):
normalizer.transform(X_test)

Created by a Microsoft Employee.