Getting into the basics of machine learning

This is a Python-based introduction to machine learning, covering only the basics, and intended for a half day session.

Some reasons for using Python is given at https://www.quora.com/Why-is-Python-a-language-of-choice-for-data-scientists. Also, in may 2017, Python overtook R as most popular among KDnuggets 2900 voters in it's annual poll.

Background

Why is machine learning "suddenly" so popular? I think a lot of the interest is due to the successes of deep learning the last few years. And since deep learning is a type of machine learning, many want to start with the basics.

(image from NVidia's blog)

What is Data Science

A nice illustration from Drew Conways blog gives some insight:

What is machine learning?

The process of iteratively learning a model from data. This model might be used for prediction of previously unseen data (supervised learning). So by providing the algorithm with lots of observations from a domain, it infers the model automatically.

One example is the classification of email into the classes spam & no-spam. The observation here is the email and metadata about it.

A more detailed explanation is found at https://github.com/amueller/scipy_2015_sklearn_tutorial/blob/master/notebooks/01.1%20Introduction%20to%20Machine%20Learning.ipynb

YouTube-playlists on machine learning

Machine Learning Recipies with Josh Gordon: 7 short videos (5-8 min) giving a quick introduction to key concepts.
Machine learning in Python with scikit-learn: 10 videos giving a more detailed introduction.

Types of machine learning problems

Machine learning infographic from Microsoft Azure: Which questions can machine learning answer?

Can be divided into 3 types of learning problems (by learning style):

Supervised learning: Given input X and output y find f such that f(X) = y
1. classification (discrete y, a category): E.g. spam/no-spam
2. regression (continuous y, a real value): E.g. house price
Unsupervised learning: Find structure in data
1. clustering
2. dimensionality reduction
3. density estimation
4. association: E.g. people that buy A also tend to buy B
5. ...
Reinforcement learning: Learning from trial-and-error by rewards/punishment

This tutorial will be on supervised learning.

Tools / setup

Anaconda with Python 3.6: Data science platform for installing all needed tools.
Python 3.5/6 (Python for Data Science pdf cheat sheet from DataCamp)
Jupyter Notebook: Web-based interactive computational environment combines code execution, rich text, mathematics, plots and rich media.

Useful Python Data Science packages included with Anaconda:

NumPy for Scientific Computing (Numpy pdf cheat sheet from DataCamp)
Pandas for Data Structure & Analysis (Pandas cheat sheet)
Matplotlib for plotting and vizualizing data.
Seaborn for Statistical data visualization (based on matplotlib)
SciKit-learn for machine learning and data mining

Check out https://github.com/rasbt/watermark for making reproducible analysis easier. %watermark

Importing useful libraries

(and verifying our installation worked)



In [ ]:

    
# Support for type hints
import typing

import numpy as np
import pandas as pd  # Data Analysis Library: high-performance, easy-to-use data structures and data analysis tools

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sb  # Statistical data visualization
# Display plots inside notebook
%matplotlib inline
# To make inline plots not blurry on retina Macbook
%config InlineBackend.figure_format = 'retina'

# Machine learning sample datasets and algorithms
from sklearn.datasets import load_iris
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

Supervised learning example: Classification with K-Nearest Neighbor Classifier.



In [ ]:

    
from matplotlib.colors import ListedColormap

# Create color maps for 3-class classification problem, as with iris
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
cmap_same = ListedColormap(['k', 'k', 'k'])

def plot_iris_knn(classify=True):
    iris = load_iris()
    X = iris.data[:, 2:]  # we only take the last two features. We could
                        # avoid this ugly slicing by using a two-dim dataset
    y = iris.target

    knn = KNeighborsClassifier(n_neighbors=3)
    knn.fit(X, y)

    x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1
    y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                         np.linspace(y_min, y_max, 100))
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure()
    if classify:
        plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

    # Plot also the training points
    if classify:
        plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
    else:
        plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_same)
    plt.xlabel('petal length (cm)')
    plt.ylabel('petal width (cm)')
    plt.title("Score = {:.2f}(+/- {:.2e})".format(knn.score(X,y), knn.score(X,y)))
    plt.axis('tight')
    plt.show()



In [ ]:

    
# set classify to False to show how the dataset looks without labels (for e.g. clustering with unsupervised learning)
plot_iris_knn(classify=True)

Data Science Workflow

(By Kenneth Jensen [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons)

There are several different workflows to use but Cross Industry Standard Process for Data Mining (CRISP-DM) remains one of the most used according to a KDnuggets poll from 2014. They mostly differ in how they divide the different steps. References for CRISP-DM:

Other workflows:

Microsofts Team Data Science Process

1. Business/Problem understanding

What is the problem?
- Informal description
- Formal description
Why does the problem need to be solved?
How would I solve the problem manually?

2. Data Understanding

2.1 Collect initial data

Which data sources do we have available?

Internal
Partners
External

What format?

Describe data
- What are the features?
- How many samples?
Explore data
Verify data quality
- Missing values?
- Wrong format?

The following Classification-example with the Iris flowers is based on the excellent (and much more detailed) notebook https://github.com/rhiever/Data-Analysis-and-Machine-Learning-Projects/blob/master/example-data-science-notebook/Example%20Machine%20Learning%20Notebook.ipynb.

A similar tutorial is also given at machinelearningmastery.com.

(image from http://blog.kaggle.com/2015/04/22/scikit-learn-video-3-machine-learning-first-steps-with-the-iris-dataset/)



In [ ]:

    
# Load dataset
iris_dataset = load_iris()
# Describe (provided description in dataset)
print(iris_dataset.DESCR)



In [ ]:

    
# Loading data into pandas Dataframe
print(iris_dataset.target_names)
iris_data = pd.DataFrame(iris_dataset.data, columns=iris_dataset.feature_names)
# Convert 0,1,2 in target column to iris_dataset.target_names
iris_data['class'] = iris_dataset.target
iris_data['class'] = iris_data['class'].apply(lambda i: iris_dataset.target_names[i])
iris_data.head()

EXERCISE: Load data set from url (solution)

Load the Iris data set (directly) from it's original source using pandas.read_csv.
Verify that the data is correct by comparing output from pandas.DataFrame.describe on dataset you loaded with dataset loaded directly from scikit learn.



In [ ]:

    
# Enter your solution here



In [ ]:

    
# Summary statistics about dataset
iris_data.describe()



In [ ]:

    
# Summary statistics for each class, but limiting to only petal length/width
iris_data.groupby('class')[['petal length (cm)', 'petal width (cm)']].describe(percentiles=[])

Note:

All features are numbers. If not, we have to convert them by e.g. One-hot-encoding, since machine learning algorithms are based on mathematics that need numbers to calculate.
Same number of flowers in each class (50)

EXERCISE: Which of the flowers do you think is easiest to classify? Why?

Data Visualization



In [ ]:

    
iris_data.plot(figsize=(10,8))

x-axis is flower number, 0-49 is one class of flowers, etc. From the plot it seems like petal length/width are the most useful features for classifying the flowers.



In [ ]:

    
# Plot pairwise relationships in dataset
# Note: Only possible to print in 2D (3D), so thats a visualization restriction even when we work on higher dimensional data.
sb.pairplot(iris_data, hue='class')

Diagonal displays distribution of each feature, off-diagonal displays pairwise correlation between the features.



In [ ]:

    
# setosa has an outlier on sepal width
print(iris_data.columns.values)
# Create histogram for column 'sepal width (cm)' and rows that have class=='setosa'.
iris_data[iris_data['class'] == 'setosa']['sepal width (cm)'].hist()



In [ ]:

    
plt.figure(figsize=(10,10))

# Making Boxplots of each of the features, split on class
for column_index, column in enumerate(iris_data.columns):
    if column == 'class':
        continue
    plt.subplot(2, 2, column_index + 1)    
    # Boxplot
    sb.boxplot(x='class', y=column, data=iris_data)
    # Violinplot is like boxplot but box is scaled according to density of the data
    #sb.violinplot(x='class', y=column, data=iris_data)

3. Data Preparation

Really important (and time-consuming), but not prioritized in this short introduction.

Select data
Clean data
- Drop samples with missing features?
Construct data
- Impute missing data?
Integrate data
Format data
Scale features
Dataset

Bad data => Bad model



In [ ]:

    
# Example of constructing new features.
iris_data['petal area'] = iris_data['petal length (cm)'] * iris_data['petal width (cm)']
iris_data[['petal area']].plot()
iris_data.groupby('class')['petal area'].describe(percentiles=[])
iris_data.drop(['petal area'], inplace=True, axis=1)

4. Modeling

Select modeling technique (see Choosing the right estimator @ scikit-learn)
- Type of problem (linear / non-linear, supervised / unsupervised, classification / regression, etc.)
- Training time
- Number of parameters
- Number of hyperparameters
- Size of data
- Quality of results
Generate test design
Build model
Assess model

Decision tree learning (nice visual introduction)

a Greedy learning algorithm:

Repeat until no or small improvement in the purity
1. Find the attribute with the highest gain
2. Add the attribute to the tree and split the set accordingly
Builds the tree in the top-down fashion
- Gradually expands the leaves of the partially built tree
The method is greedy
- It looks at a single attribute and gain in each step
- May fail when the combination of attributes is needed to improve the purity (parity functions)

Gini impurity is a measure of misclassification.

Impurity measure defines how well the classes are separated.

In general the impurity measure should satisfy:

Largest when data are split evenly for attribute values
Should be 0 when all data belong to the same class



In [ ]:

    
# 1. Using Decision Tree as modeling technique:
decision_tree_classifier = DecisionTreeClassifier()



In [ ]:

    
# 2. Generate test design
# Load data in input (X) and output (y) variables
y = iris_data['class'].values
X = iris_data.drop('class', axis=1).values

print("Shape of input: %s" % str(X.shape))
print("Shape of output: %s" % str(y.shape))

from sklearn.model_selection import train_test_split

# split data into (75%) training and (25%) test set. 
# Fixing random state for demonstration purpose
(X_train, X_test, y_train, y_test) = train_test_split(X, y, train_size=0.75, random_state=1)
print("Shape of training data: %s" % str(X_train.shape))
print("Shape of test data: %s" % str(X_test.shape))



In [ ]:

    
# 3. Build model
# Train the classifier on the training set
decision_tree_classifier.fit(X_train, y_train)



In [ ]:

    
# Visualize the decision tree
from IPython.display import Image  
from sklearn import tree
import pydotplus

def visualize_decision_tree(decision_tree_classifier):
    dot_data = tree.export_graphviz(decision_tree_classifier, out_file=None, 
                             feature_names=iris_dataset.feature_names,  
                             class_names=iris_dataset.target_names,  
                             filled=True, rounded=True,  
                             special_characters=True)  

    graph = pydotplus.graph_from_dot_data(dot_data)  
    return Image(graph.create_png())

visualize_decision_tree(decision_tree_classifier)

How to measure our success?

TODO: Different ways of measuring. What is a good way? Depends on problem. https://en.wikipedia.org/wiki/Confusion_matrix



In [ ]:

    
# What is the result on the training set?
decision_tree_classifier.score(X_train, y_train)

100 % correct on training set! Is that good?



In [ ]:

    
# Predict class on whole test set
y_pred = decision_tree_classifier.predict(X_test)

# Compare prediction of first element in test set with real class
print("Predicted class first element: {}".format(y_pred[0]))
# Real class:
print("Real class first element: {}".format(y_test[0]))
#print(y_pred)
#print(y_test)
correct_predictions = np.sum(y_pred == y_test)
print("Count number of predictions equal to real class: {:d}".format(correct_predictions))
total_predictions = y_pred.shape[0]
print("Total number of predictions: {:d}".format(total_predictions))
accuracy = correct_predictions / total_predictions
print("Accuracy {}/{}={:2f}".format(correct_predictions, total_predictions, accuracy))



In [ ]:

    
# Validate the classifier on the testing set using classification accuracy
decision_tree_classifier.score(X_test, y_test)

Is that good? Depends on the sampling in train/test-split, business problem, ...



In [ ]:

    
# Check how accuracy varies with different samples (random_state is not fixed)
model_accuracies = []

for repetition in range(1000):
    (X_train, X_test, y_train, y_test) = train_test_split(X, y, train_size=0.75)
    
    decision_tree_classifier = DecisionTreeClassifier()
    decision_tree_classifier.fit(X_train, y_train)
    classifier_accuracy = decision_tree_classifier.score(X_test, y_test)
    model_accuracies.append(classifier_accuracy)
    
ax = sb.distplot(model_accuracies)
ax.set_title('Average score: {:.2f} ({:.3f})'.format(np.mean(model_accuracies), np.std(model_accuracies)))

Big variations in how our model perform for different subsets of training data.

Does not generalize well on new data (X_test) => overfitting

This problem is the main reason that most data scientists perform k-fold cross-validation on their models: Split the original data set into k subsets, use one of the subsets as the testing set, and the rest of the subsets are used as the training set. This process is then repeated k times such that each subset is used as the testing set exactly once. 10-fold cross-validation is the most common choice, so let's use that here. Performing 10-fold cross-validation on our data set looks something like this:



In [ ]:

    
import numpy as np
from sklearn.model_selection import StratifiedKFold

def plot_cv(cv, n_samples):
    masks = []
    for train, test in cv:
        mask = np.zeros(n_samples, dtype=bool)
        mask[test] = 1
        masks.append(mask)
        
    plt.figure(figsize=(15, 15))
    plt.imshow(masks, interpolation='none')
    plt.ylabel('Fold')
    plt.xlabel('Row #')

skf = StratifiedKFold(n_splits=10)
plot_cv(skf.split(X,y), len(y))

Stratified k-fold cross-validation: Stratified means the class proportions the same across all of the folds, so we maintain a representative subset of our data set. (e.g., so we don't have 100% Iris setosa entries in one of the folds.)



In [ ]:

    
from sklearn.model_selection import cross_val_score

decision_tree_classifier = DecisionTreeClassifier()

# cross_val_score returns a list of the scores, which we can visualize
# to get a reasonable estimate of our classifier's performance
cv_scores = cross_val_score(decision_tree_classifier, X, y, cv=10)
sb.distplot(cv_scores)
plt.title('Average score: {:.2f} ({:.3f})'.format(np.mean(cv_scores), np.std(cv_scores)))

Hyperparameter tuning

Limiting the depth of decision tree classifier gives much lower accuracy. What hyperparameter choice is best?



In [ ]:

    
# max_depth=1 gives low score
decision_tree_classifier = DecisionTreeClassifier(max_depth=1)

cv_scores = cross_val_score(decision_tree_classifier, X, y, cv=10)
sb.distplot(cv_scores, kde=False)
plt.title('Average score: {:.2f}'.format(np.mean(cv_scores)))

One common systematic way to find best parameters for model and data set: Grid Search.



In [ ]:

    
from sklearn.model_selection import GridSearchCV

decision_tree_classifier = DecisionTreeClassifier()

# Which parameters can be searched
print(decision_tree_classifier.get_params())

parameter_grid = {'max_depth': [1, 2, 3, 4, 5],
                  'max_features': [1, 2, 3, 4]}

grid_search = GridSearchCV(decision_tree_classifier,
                           param_grid=parameter_grid,
                           cv=10)

grid_search.fit(X, y)
print('Best score: {:.2f}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))



In [ ]:

    
grid_visualization = grid_search.cv_results_['mean_test_score']

depth_count = len(parameter_grid['max_depth']) # 5
feature_count = len(parameter_grid['max_features']) # 4

grid_visualization = np.array(grid_visualization)
grid_visualization.shape = (depth_count, feature_count)
sb.heatmap(grid_visualization, cmap='Blues')
plt.xticks(np.arange(feature_count) + 0.5, grid_search.param_grid['max_features'])
plt.yticks(np.arange(depth_count) + 0.5, grid_search.param_grid['max_depth'][::-1])
plt.xlabel('max_features')
plt.ylabel('max_depth')



In [ ]:

    
decision_tree_classifier = DecisionTreeClassifier()

parameter_grid = {'criterion': ['gini', 'entropy'],
                  'splitter': ['best', 'random'],
                  'max_depth': [1, 2, 3, 4, 5],
                  'max_features': [1, 2, 3, 4]}

grid_search = GridSearchCV(decision_tree_classifier,
                           param_grid=parameter_grid,
                           cv=10)

grid_search.fit(X, y)
print('Best score: {:.2f}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))



In [ ]:

    
best_decision_tree_classifier = grid_search.best_estimator_
best_decision_tree_classifier



In [ ]:

    
visualize_decision_tree(best_decision_tree_classifier)

Ensemble learning example: Random forest

A powerful idea: Combining weak learners to make a strong one.



In [ ]:

    
from sklearn.ensemble import RandomForestClassifier

random_forest_classifier = RandomForestClassifier()

parameter_grid = {'n_estimators': [5, 10, 25, 50],
                  'criterion': ['gini', 'entropy'],
                  'max_features': [1, 2, 3, 4],
                  'warm_start': [True, False]}

grid_search = GridSearchCV(random_forest_classifier,
                           param_grid=parameter_grid,
                           cv=10)

grid_search.fit(X, y)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))

grid_search.best_estimator_

5. Evaluation

Evaluate results
Review process
- Reproducibility
Determine next steps

6. Deployment (Not covered)

Monitoring (Dashboards, Reports)
Model Management: When to update a model?

Machine learning cloud services

Amazon Machine Learning
Google Cloud AI
Microsoft Azure Machine Learning / Microsoft Azure Machine Learning Studio Capabilities Overview
Intel Nervana (deep learning)
Nvidia Digits (deep learning)
Dataiku (data science platform)

References / Resources

machinelearningmastery.com: An excellent blog by Jason Brownlee
KDnuggets.com
CRISP-DM (pdf): CRoss-Industry Standard Process for Data Mining
ML best practice (pdf): Best practice for ML Engineering @ Google
Dive into machine learning (with Python): Awesome github page with lots of resources.
Awesome Datascience Github page: Another Awesome collection of resources in data science.
Awesome Machine learning (Python section): A curated list of awesome machine learning frameworks, libraries and software.
Machine learning tutorials (curation): A curated list of Machine Learning and Deep Learning tutorials, articles and other resources.
Natural Language Processing with NLTK and Python Tutorial

Supervised learning example: Linear regression with linear model



In [ ]:

    
# Based on https://github.com/jakevdp/PMLC-2014/blob/master/fig_code/linear_regression.py
max_x = 30
np.random.seed(0)

def generate_linear_data(n: int) -> (np.array, np.array):
    a = 0.5
    b = 1.0

    # generate n random values in the range [0, max_x)
    x = np.sort(max_x * np.random.random(n))

    # y = a*x + b with noise
    noise = np.random.normal(size=x.shape)
    y = a * x + b + noise
    
    return x, y

    
def create_linear_model(x: np.array, y: np.array) -> (np.array, np.array):
    # create a linear regression model
    ## linear regression attempts to draw a straight line that will 
    ## best minimize the residual sum of squares between 
    ## the model and the observed responses in the dataset
    model = LinearRegression()
    # Setting correct dimension
    X = x.reshape(-1,1)
    model.fit(X, y)
    # From http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score:
    # Score for LinearRegression: The coefficient R^2 is defined as (1 - u/v), 
    # where u is the residual sum of squares ((y_true - y_pred) ** 2).sum()  and
    # v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). 
    # The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse).
    print("Goodness-of-fit on samples: {:.2f} (1.0 is perfect fit)".format(model.score(X,y)))
    # *-operator is unpacking argument list (https://docs.python.org/3/tutorial/controlflow.html#unpacking-argument-lists)
    print("Model parameters: a = {:.2f}, b = {:.2f}".format(*model.coef_, model.intercept_))
    return model


def plot_results(x_samples: np.array, y_samples: np.array,
                 model):
    # 100 values with equal distance from 0 to max_x
    x_pred = np.linspace(0, max_x, 100)
    # predict y from the data
    y_pred = model.predict(x_pred.reshape(-1,1))
    
    ax = plt.axes()
    # Scatter plot samples
    ax.scatter(x_samples, y_samples, label='Samples')
    # Plot prediction model
    ax.plot(x_pred, y_pred, label='Model')
    
    # Plot error lines
    label = False
    for x, y in zip(x_samples, y_samples):
        y_model = model.predict(x)
        # plot vertical error line from model to sample
        # 2 x,y pairs: (x_sample, y_model) and (x_sample, y_sample)
        if not label: # only print label on first error line
            ax.plot( [x,x], [y_model,y], 'r', label='Error' )
            label = True
        else:
            ax.plot( [x,x], [y_model,y], 'r')
    
    # Plotting specifics
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.legend(loc='best')
    ax.axis('tight')



In [ ]:

    
# Try with larger sample size than 20, e.g. 1000, and see that model parameters are closer to original line: a=0.5, b=1.0
# Try a few runs with sample size 2. Explain the Goodness-of-fit and model parameters.
x, y = generate_linear_data(20)
model = create_linear_model(x, y)
plot_results(x, y, model)
plt.show()

Line is fitted to samples/points by minimizing the total error.

Example of linear regression with higher order polynomial + overfitting/underfitting

True function is $\cos(1.5\pi x)$



In [ ]:

    
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

n_samples = 30
max_x = 1
plot_degrees = [1, 4, 15]
degrees = list(range(1,16))
MSE = [] # list of mean squared error for given degree index

true_fun = lambda x: np.cos(1.5 * np.pi * x)
X_array = np.sort(np.random.rand(n_samples))
noise = np.random.normal(size=X_array.shape, scale=0.1)
y = true_fun(X_array) + noise
# Changing shape from (n_samples, ) to (n_samples, 1).
X = X_array.reshape(-1,1)

plt.figure(figsize=(14, 5))
plot_idx = 0

for i in range(len(degrees)):
    polynomial_features = PolynomialFeatures(degree=degrees[i],
                                             include_bias=False)
    linear_regression = LinearRegression()
    pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)])
    pipeline.fit(X, y)

    # Evaluate the models using crossvalidation
    scores = cross_val_score(pipeline, X, y,
                             scoring="neg_mean_squared_error", cv=10)
    
    # Training error decreases with increasing model complexity (increasing degree of fitted polynomial)
    y_pred = pipeline.predict(X)
    y_true = y
    MSE.append(mean_squared_error(y_true, y_pred))
    print("Degree: {} -- MSE {:.4f} -- CV score {:.4f}".format(degrees[i], MSE[i], -scores.mean()))

    if degrees[i] in plot_degrees:
        ax = plt.subplot(1, len(plot_degrees), plot_idx + 1)
        plt.setp(ax, xticks=(), yticks=())

        X_test = np.linspace(0, max_x, 100)
        plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
        plt.plot(X_test, true_fun(X_test), label="True function")
        plt.scatter(X.reshape(-1), y, label="Samples")
        plt.xlabel("x")
        plt.ylabel("y")
        plt.xlim((0, max_x))
        plt.ylim((-2, 2))
        plt.legend(loc="best")
        plt.title("Degree {}\nMSE = {:.3f} (+/- {:.2e})".format(
            plot_degrees[plot_idx], -scores.mean(), scores.std()))
        plot_idx +=1
plt.show()

ax_train_error = plt.axes()
ax_train_error.plot(degrees, MSE)
plt.title('Training error vs complexity')

Task: Finding model of correct complexity for given data set.

Overfitting: Model is too complex for data set (captures also the noise in the data set)
Underfitting: Model is too simple for data set (does not capture true nature of data set)

EXERCISE: Split sample data into training and test set and plot the test error

as a function of model complexity (degree of polynomial)

Regression Example: Boston House Prices



In [ ]:

    
# See http://www.neural.cz/dataset-exploration-boston-house-pricing.html
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston

# Loading Boston House Prices dataset
# This dataset is already cleaned (no missing values)
boston_dataset = load_boston()

# Print data set Characteristics
print(boston_dataset.DESCR)

# A dataframe in Pandas is similar to a table
df = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
# target here is price (MEDV), but note that we could have chosen another one!
df['target'] = boston_dataset.target

A model is an assumption about how the world works. What could be a reasonable model for predicting house price?



In [ ]:

    
#print(df.describe(percentiles=[]).T)



In [ ]:

    
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use = 'default'
df.boxplot()



In [ ]:

    
# Pandas styling the dataframe
def highlight_max(s):
    '''
    highlight the maximum in a Series yellow.
    '''
    is_max = s == s.max()
    return ['background-color: yellow' if v else '' for v in is_max]

# Coloring the target column green. For more colors: http://www.w3schools.com/cssref/css_colors.asp
df.head(10).style.apply(lambda s: ['background-color: darkseagreen']*s.size, axis=0, subset=['target'])



In [ ]:

    
from pandas.plotting import scatter_matrix

scatter_matrix(df, alpha=0.2, figsize=(10, 10), diagonal='kde')

Correlation heatmap



In [ ]:

    
import seaborn as sns # conventional alias
#sns.set(style="white")

# Compute the correlation matrix
corr = df.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
#cmap = sns.diverging_palette(220, 10, as_cmap=True)

# By using heatmap twice we get bold all correlations above 0.6
sns.heatmap(corr, mask=mask, annot=True)
sns.heatmap(corr, mask= abs(corr) < 0.6, cbar=False, annot=True)
            #annot_kws={"weight": "bold"})
# Draw the heatmap with the mask and correct aspect ratio
#sns.heatmap(corr, cmap=cmap, vmax=.3,
#            square=True, xticklabels=5, yticklabels=5,
#            linewidths=.5, cbar_kws={"shrink": .5}, ax=ax)



In [ ]:

    
# Want to look further on correlation between RM, LSTAT and house price (MEDV)
sns.jointplot(df['RM'], df['target'])
sns.jointplot(df['LSTAT'], df['target'])



In [ ]:

    
# Which of the features exhibit a linear correlation with the target?
corr.iloc[-1].apply(abs).sort_values(ascending=True)



In [ ]:

    
#X = df[['RM', 'LSTAT']]
X = df[['LSTAT']]
y = df['target']
X.boxplot()



In [ ]:

    
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Split data into (80%) training and (20%) test data
X_train, X_test, y_train,  y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Assuming a linear relationship between number of Rooms and House price
model = LinearRegression()

# Train model
model.fit(X_train, y_train)



In [ ]:

    
def show_results(model, X_test, y_test):
    # Predict y-values from model:
    y_pred = model.predict(X_test)
    # The coefficients
    #print('Coefficients: \n', model.coef_)
    # The mean squared error
    print("Mean squared error: %.2f" % np.mean((y_pred - y_test) ** 2))
    # Explained variance score: 1 is perfect prediction
    print('Variance score: %.2f' % model.score(X_test, y_test))
    # Discarding higher order terms
    if X_test.shape[1] > 1:
        # Getting second column
        X_linear = X_test[:,1]
    else:
        X_linear = X_test
    # Plot outputs
    n = y_pred.shape[0]
    plt.scatter(X_linear, y_test,  color='black')
    X_model = np.linspace(0, X_test.max(), n)
    plt.plot(X_model, model.predict(X_model.reshape(-1,1)), color='blue', linewidth=3)

    plt.show()



In [ ]:

    
show_results(model, X_test, y_test)

Linear fit does not seem ideal. Let's try quadratic!



In [ ]:

    
# Code based on http://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

# Input 
# X: one feature
# y: target values
# degree: highest order term in polynomial feature
def train_poly(X, y, degree=2):
    # It is still called linear regression, since it is a linear combination of polynomial terms.
    linear_regression = LinearRegression()

    polynomial_features = PolynomialFeatures(degree=degree)
    pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)])
    pipeline.fit(X, y)
    # Evaluate the models using crossvalidation
    scores = cross_val_score(pipeline, X, y,
                             scoring="neg_mean_squared_error", cv=10)
    print("Degree {}\nMSE = {:.2e}(+/- {:.2e})".format(
        degree, -scores.mean(), scores.std()))
    return pipeline



In [ ]:

    
model_poly = train_poly(X_train, y_train)

show_results(model_poly, X_test, y_test)



In [ ]:

    
show_results(train_poly(X_train, y_train, degree=5), X_test, y_test)



In [ ]: