Textbook examples

Fairy tales kind of situation

The data has been processed, ready to analysis
The learning objective is clearly defined
Something that kNN works
Just something you feel so good working with

Classification example

Breast Cancer Wisconsin (Diagnostic) Data Set https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)



In [ ]:

    
from sklearn.datasets import load_breast_cancer
ourdata = load_breast_cancer()
print ourdata.DESCR



In [ ]:

    
print ourdata.data.shape
ourdata.data



In [ ]:

    
ourdata.target



In [ ]:

    
ourdata.target.shape



In [ ]:

    
ourdata.target_names

Data preparation

Split the data into training set and test set



In [ ]:

    
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(ourdata.data, ourdata.target, test_size=0.3)



In [ ]:

    
print X_train.shape
print y_train.shape
print X_test.shape
print y_test.shape



In [ ]:

    
from scipy.stats import itemfreq
itemfreq(y_train)

Let's try an unrealistic algorithm: kNN



In [ ]:

    
from sklearn.neighbors import KNeighborsClassifier
hx_knn = KNeighborsClassifier()



In [ ]:

    
hx_knn.fit(X_train, y_train)



In [ ]:

    
hx_knn.predict(X_train)

Training set performance evaluation

Confusion matrix

https://en.wikipedia.org/wiki/Confusion_matrix



In [ ]:

    
from sklearn.metrics import confusion_matrix, f1_score

print confusion_matrix(y_train, hx_knn.predict(X_train))
print f1_score(y_train, hx_knn.predict(X_train))

Moment of truth: test set performance evaluation



In [ ]:

    
print confusion_matrix(y_test, hx_knn.predict(X_test))
print f1_score(y_test, hx_knn.predict(X_test))

Classical analysis: Logistic Regression Classifier (Mint)



In [ ]:

    
from sklearn.linear_model import LogisticRegression
hx_log = LogisticRegression()



In [ ]:

    
hx_log.fit(X_train, y_train)



In [ ]:

    
hx_log.predict(X_train)



In [ ]:

    
# Training set evaluation
print confusion_matrix(y_train, hx_log.predict(X_train))
print f1_score(y_train, hx_log.predict(X_train))



In [ ]:

    
# test set evaluation
print confusion_matrix(y_test, hx_log.predict(X_test))
print f1_score(y_test, hx_log.predict(X_test))

Your turn: Task 1

Suppose there is a learning algorithm called Naive Bayes and the API is located in

http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB

Create a hx_nb, fit the data, and evaluate the training set and test set performance.

Regression example

Boston house price dataset

https://archive.ics.uci.edu/ml/datasets/Housing



In [ ]:

    
from sklearn.datasets import load_boston
bostondata = load_boston()
print bostondata.target
print bostondata.data.shape



In [ ]:

    
### Learn more about the dataset

print bostondata.DESCR



In [ ]:

    
# how the first row of data looks like

bostondata.data[1,]



In [ ]:

    
BX_train, BX_test, By_train, By_test = train_test_split(bostondata.data, bostondata.target, test_size=0.3)

Classical algo: Linear Regression



In [ ]:

    
from sklearn.linear_model import LinearRegression
hx_lin = LinearRegression()



In [ ]:

    
hx_lin.fit(BX_train, By_train)



In [ ]:

    
hx_lin.predict(BX_train)

Plot a scatter plot of predicted and actual value



In [ ]:

    
import matplotlib.pyplot as plt
plt.scatter(By_train, hx_lin.predict(BX_train))
plt.ylabel("predicted value")
plt.xlabel("actual value")
plt.show()



In [ ]:

    
### performance evaluation: training set

from sklearn.metrics import mean_squared_error
mean_squared_error(By_train, hx_lin.predict(BX_train))



In [ ]:

    
### performance evaluation: test set
mean_squared_error(By_test, hx_lin.predict(BX_test))



In [ ]: