In [2]:
import pandas as pd 
import numpy as np
from os import listdir
from IPython.display import Image

Introduction to Machine learning

Shir Meir Lador

Data scientist - the sexiest job of the 21st century (Harvard Business Review 2012)

The “data scientist” (2008) - ״It’s a high-ranking professional with the training and curiosity to make discoveries in the world of big data.״

https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century

Outline

  • What is machine learning?
  • Example of machine learning in practice
  • History of machine learning
  • Supervised and unsupervised learning
  • Classification and regression
  • Generalization
  • Data - train and test
  • Overfitting
  • Linear and logistic regression
  • Decision trees and Random forests
  • Bias variance tradeoff
  • SVM
  • Example data - visualization of the decision boundaries of different models
  • Bonus: Pro tips

What is machine learning?

Machine learning is the subfield of computer science that gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959).

The study and construction of algorithms that can learn from and make predictions on data (through building a model from sample inputs).

Example of machine learning in practice

Approving loans automatically using machine learning models based on:

  • Features based on the user bank account
  • Features based on the user credit scoring
  • Features based on the user web appearances
  • ...

History of machine learning

1950 — “Turing Test” - determine if a computer has real intelligence. To pass the test, a computer must be able to fool a human into believing it is also human.

1952 — First computer learning program (Arthur Samuel) - Checkers game. The IBM computer improved at the game the more it played, studying which moves made up winning strategies and incorporating those moves into its program.

1957 — First neural network for computers - the perceptron (Frank Rosenblatt), which "simulate" the thought processes of the human brain.

1967 — The “nearest neighbor” algorithm was written - basic pattern recognition. This could be used to map a route for traveling salesmen, starting at a random city but ensuring they visit all cities during a short tour.

1997 — IBM’s Deep Blue beats the world champion at chess.

2006 — Geoffrey Hinton coins the term “deep learning” to explain new algorithms that let computers “see” and distinguish objects and text in images and videos.

Supervised and unsupervised learning

Supervised - labels are given


Unsupervised - no labels

Overfitting

Overfitting occurs when a model begins to "memorize" training data rather than "learning" to generalize from trend

When a statistical model describes random error or noise instead of the underlying relationship.

Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations.

Overfitting

A model that has been overfit has poor predictive performance, as it overreacts to minor fluctuations in the training data.

Linear regression - I

A linear approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables X.

Linear regression - II

OLS - The Ordinary Least Squares method minimizes the sum of squared errors.

Logistic Regression - I

Linear approach for classification.

The logistic regression estimates the probability that the dependent variable is 0/1.

Logistic Regression - II

Use the logit transformation to transform the linear regression result to a probability.

Decision tree

Idea - series of binary questions

Decision tree - training

Decision tree - How to split?

Decision tree - prediction

Decision tree vs. Logistic regression

Which model should I choose?

  1. What kind of decision boundary makes more sense in my problem?

  2. How complex is the relationship between my variables and target?

  3. Are there interactions between my features?

  4. How many features and samples do I have?

  5. Try both models and do cross-validation - help you find out which one is more likely to have better generalization error.

Error of a model - Bias Variance Tradeoff

Bias Variance Tradeoff - II

Bias Variance Tradeoff - III

Random forest (Ensemble learning method)

Average prediction of many decision trees trained on different subsets of the features

  • Ensemble methods reduce the prediction variance --> Reduce the prediction error (and decrease overfitting).
  • Reminder - $var(x)=\sigma^2$, $\space\space\space$ $var(\frac {1}{n} \Sigma(x_i) ) = \frac {1}{n^2} var(\Sigma(x_i)) = \frac {n}{n^2} {\sigma^2}=\frac {\sigma^2}{n}$

Random forest

Random forest

Linear SVM

Find the hyperplane that maximizes the margin.

The larger the margin the lower the generalization error of the classifier.

Non-linear SVM (using Kernels)

Kernel trick - implicitly mapping the inputs into high-dimensional feature spaces.

SVM - Kernel trick

Deep Learning - a glimpse

Example Data


In [14]:
%matplotlib inline
import numpy as np
import pylab as pl
from sklearn import svm
from sklearn.naive_bayes import GaussianNB
from sklearn import linear_model
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn import tree, ensemble

import pandas as pd
from matplotlib import pyplot as plt
def plot_results_with_hyperplane(clf, clf_name, df, plt_nmbr):
    x_min, x_max = df.x.min() - .5, df.x.max() + .5
    y_min, y_max = df.y.min() - .5, df.y.max() + .5
    # step between points. i.e. [0, 0.02, 0.04, ...]
    step = .02
    # to plot the boundary, we're going to create a matrix of every possible point
    # then label each point as a wolf or cow using our classifier
    xx, yy = np.meshgrid(np.arange(x_min, x_max, step), np.arange(y_min, y_max, step))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    # this gets our predictions back into a matrix
    Z = Z.reshape(xx.shape)
    # create a subplot (we're going to have more than 1 plot on a given image)
    pl.subplot(2, 3, plt_nmbr)
    # plot the boundaries
    pl.pcolormesh(xx, yy, Z, cmap=pl.cm.Paired)
    # plot the wolves and cows
    for animal in df.animal.unique():
        pl.scatter(df[df.animal==animal].x,
                   df[df.animal==animal].y,
                   marker=animal, s=70,
                   label="cows" if animal=="x" else "wolves",
                   color='black')
    pl.title(clf_name, fontsize=20)

data = open("cows_and_wolves.txt").read()
data = [row.split('\t') for row in data.strip().split('\n')]

animals = []
for y, row in enumerate(data):
    for x, item in enumerate(row):
        # x's are cows, o's are wolves
        if item in ['o', 'x']:
            animals.append([x, y, item])

df = pd.DataFrame(animals, columns=["x", "y", "animal"])
df['animal_type'] = df.animal.apply(lambda x: 0 if x=="x" else 1)

# train using the x and y position coordiantes
train_cols = ["x", "y"]

clfs = {
    "SVM": svm.SVC(),
    "Logistic Regression" : linear_model.LogisticRegression(),
    "Decision Tree": tree.DecisionTreeClassifier(),
    "Random Forest": ensemble.RandomForestClassifier(random_state=0),
    "K-Nearest Neighbors Classifier": KNeighborsClassifier(n_neighbors=3), 
    "Gaussian Naive Bayes": GaussianNB(),
}

In [15]:
plt.figure(figsize=(20,12))
plt_nmbr = 1
for clf_name, clf in clfs.items():
    clf.fit(df[train_cols], df.animal_type)
    plot_results_with_hyperplane(clf, clf_name, df, plt_nmbr)
    plt_nmbr += 1
pl.show()


Pro tips

Questions?