Applying Machine Learning Techniques-Regression

Homepage: https://github.com/tien-le/kaggle-titanic

Updating later ...



In [1]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
import random

Load Corpus After Preprocessing ...



In [2]:

    
#Training Corpus
trn_corpus_after_preprocessing = pd.read_csv("output/trn_corpus_after_preprocessing.csv")

#Testing Corpus
tst_corpus_after_preprocessing = pd.read_csv("output/tst_corpus_after_preprocessing.csv")



In [3]:

    
#tst_corpus_after_preprocessing[tst_corpus_after_preprocessing["Fare"].isnull()]



In [4]:

    
trn_corpus_after_preprocessing.info()
print("-"*36)
tst_corpus_after_preprocessing.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 890 entries, 0 to 889
Data columns (total 13 columns):
PassengerId          890 non-null int64
Male                 890 non-null int64
Pclass               890 non-null int64
Fare                 890 non-null float64
FarePerPerson        890 non-null float64
Title                890 non-null int64
AgeUsingMeanTitle    890 non-null float64
AgeClass             890 non-null float64
SexClass             890 non-null int64
FamilySize           890 non-null int64
AgeSquared           890 non-null float64
AgeClassSquared      890 non-null float64
Survived             890 non-null int64
dtypes: float64(6), int64(7)
memory usage: 90.5 KB
------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 13 columns):
PassengerId          418 non-null int64
Male                 418 non-null int64
Pclass               418 non-null int64
Fare                 418 non-null float64
FarePerPerson        418 non-null float64
Title                418 non-null int64
AgeUsingMeanTitle    418 non-null float64
AgeClass             418 non-null float64
SexClass             418 non-null int64
FamilySize           418 non-null int64
AgeSquared           418 non-null float64
AgeClassSquared      418 non-null float64
Survived             418 non-null int64
dtypes: float64(6), int64(7)
memory usage: 42.5 KB

Basic & Advanced machine learning tools

Agenda

What is machine learning?
What are the two main categories of machine learning?
What are some examples of machine learning?
How does machine learning "work"?

What is machine learning?

One definition: "Machine learning is the semi-automated extraction of knowledge from data"

Knowledge from data: Starts with a question that might be answerable using data
Automated extraction: A computer provides the insight
Semi-automated: Requires many smart decisions by a human

What are the two main categories of machine learning?

Supervised learning: Making predictions using data

Example: Is a given email "spam" or "ham"?
There is an outcome we are trying to predict

Unsupervised learning: Extracting structure from data

Example: Segment grocery store shoppers into clusters that exhibit similar behaviors
There is no "right answer"

How does machine learning "work"?

High-level steps of supervised learning:

First, train a machine learning model using labeled data
- "Labeled data" has been labeled with the outcome
- "Machine learning model" learns the relationship between the attributes of the data and its outcome
Then, make predictions on new data for which the label is unknown

The primary goal of supervised learning is to build a model that "generalizes": It accurately predicts the future rather than the past!

Questions about machine learning

How do I choose which attributes of my data to include in the model?
How do I choose which model to use?
How do I optimize this model for best performance?
How do I ensure that I'm building a model that will generalize to unseen data?
Can I estimate how well my model is likely to perform on unseen data?

Benefits and drawbacks of scikit-learn

Benefits:

Consistent interface to machine learning models
Provides many tuning parameters but with sensible defaults
Exceptional documentation
Rich set of functionality for companion tasks
Active community for development and support

Potential drawbacks:

Harder (than R) to get started with machine learning
Less emphasis (than R) on model interpretability

Types of supervised learning

Classification: Predict a categorical response
Regression: Predict a ordered/continuous response
Note that each value we are predicting is the response (also known as: target, outcome, label, dependent variable)

Model evaluation metrics

Regression problems: Mean Absolute Error, Mean Squared Error, Root Mean Squared Error
Classification problems: Classification accuracy

Load Corpus



In [5]:

    
trn_corpus_after_preprocessing.columns









    Out[5]:





Index(['PassengerId', 'Male', 'Pclass', 'Fare', 'FarePerPerson', 'Title',
       'AgeUsingMeanTitle', 'AgeClass', 'SexClass', 'FamilySize', 'AgeSquared',
       'AgeClassSquared', 'Survived'],
      dtype='object')



In [6]:

    
list_of_non_preditor_variables = ['Survived','PassengerId']



In [7]:

    
#Method 1
#x_train = trn_corpus_after_preprocessing.ix[:, trn_corpus_after_preprocessing.columns != 'Survived']
#y_train = trn_corpus_after_preprocessing.ix[:,"Survived"]

#Method 2
x_train = trn_corpus_after_preprocessing[trn_corpus_after_preprocessing.columns.difference(list_of_non_preditor_variables)].copy()
y_train = trn_corpus_after_preprocessing['Survived'].copy()
#y_train = trn_corpus_after_preprocessing.iloc[:,-1]
#y_train = trn_corpus_after_preprocessing[trn_corpus_after_preprocessing.columns[-1]]

#x_train



In [8]:

    
#y_train



In [9]:

    
x_train.columns









    Out[9]:





Index(['AgeClass', 'AgeClassSquared', 'AgeSquared', 'AgeUsingMeanTitle',
       'FamilySize', 'Fare', 'FarePerPerson', 'Male', 'Pclass', 'SexClass',
       'Title'],
      dtype='object')



In [10]:

    
# check the types of the features and response
#print(type(x_train))
#print(type(x_test))



In [11]:

    
#Method 1
#x_test = tst_corpus_after_preprocessing.ix[:, trn_corpus_after_preprocessing.columns != 'Survived']
#y_test = tst_corpus_after_preprocessing.ix[:,"Survived"]

#Method 2
x_test = tst_corpus_after_preprocessing[tst_corpus_after_preprocessing.columns.difference(list_of_non_preditor_variables)].copy()
y_test = tst_corpus_after_preprocessing['Survived'].copy()
#y_test = tst_corpus_after_preprocessing.iloc[:,-1]
#y_test = tst_corpus_after_preprocessing[tst_corpus_after_preprocessing.columns[-1]]



In [12]:

    
#x_test



In [13]:

    
#y_test



In [14]:

    
# display the first 5 rows
x_train.head()









    Out[14]:







  
    
      
      AgeClass
      AgeClassSquared
      AgeSquared
      AgeUsingMeanTitle
      FamilySize
      Fare
      FarePerPerson
      Male
      Pclass
      SexClass
      Title
    
  
  
    
      0
      66.0
      4356.0
      484.0
      22.0
      1
      7.2500
      3.62500
      1
      3
      3
      3
    
    
      1
      38.0
      1444.0
      1444.0
      38.0
      1
      71.2833
      35.64165
      0
      1
      0
      3
    
    
      2
      78.0
      6084.0
      676.0
      26.0
      0
      7.9250
      7.92500
      0
      3
      0
      3
    
    
      3
      35.0
      1225.0
      1225.0
      35.0
      1
      53.1000
      26.55000
      0
      1
      0
      3
    
    
      4
      105.0
      11025.0
      1225.0
      35.0
      0
      8.0500
      8.05000
      1
      3
      3
      3



In [15]:

    
# display the last 5 rows
x_train.tail()









    Out[15]:







  
    
      
      AgeClass
      AgeClassSquared
      AgeSquared
      AgeUsingMeanTitle
      FamilySize
      Fare
      FarePerPerson
      Male
      Pclass
      SexClass
      Title
    
  
  
    
      885
      117.000000
      13689.00000
      1521.000000
      39.000000
      5
      29.125
      4.854167
      0
      3
      0
      3
    
    
      886
      54.000000
      2916.00000
      729.000000
      27.000000
      0
      13.000
      13.000000
      1
      2
      2
      0
    
    
      887
      19.000000
      361.00000
      361.000000
      19.000000
      0
      30.000
      30.000000
      0
      1
      0
      3
    
    
      888
      86.061263
      7406.54097
      822.948997
      28.687088
      3
      23.450
      5.862500
      0
      3
      0
      3
    
    
      889
      26.000000
      676.00000
      676.000000
      26.000000
      0
      30.000
      30.000000
      1
      1
      1
      3



In [16]:

    
# check the shape of the DataFrame (rows, columns)
x_train.shape









    Out[16]:





(890, 11)

What are the features?

AgeClass:
AgeClassSquared:
AgeSquared:
...

What is the response?

Survived: 1-Yes, 0-No

What else do we know?

Because the response variable is dicrete, this is a Classification problem.
There are 200 observations (represented by the rows), and each observation is a single market.

Note that if the response variable is continuous, this is a regression problem.



In [ ]:

Decision Trees Classification



In [17]:

    
from sklearn import tree

clf = tree.DecisionTreeClassifier()
clf = clf.fit(x_train, y_train)



In [18]:

    
#Once trained, we can export the tree in Graphviz format using the export_graphviz exporter. 
#Below is an example export of a tree trained on the entire iris dataset:
with open("output/titanic.dot", 'w') as f:
    f = tree.export_graphviz(clf, out_file=f)

#Then we can use Graphviz’s dot tool to create a PDF file (or any other supported file type): 
#dot -Tpdf titanic.dot -o titanic.pdf.
import os
os.unlink('output/titanic.dot')

#Alternatively, if we have Python module pydotplus installed, we can generate a PDF file 
#(or any other supported file type) directly in Python:
import pydotplus 
dot_data = tree.export_graphviz(clf, out_file=None) 
graph = pydotplus.graph_from_dot_data(dot_data) 
graph.write_pdf("output/titanic.pdf")









    Out[18]:





True



In [19]:

    
#The export_graphviz exporter also supports a variety of aesthetic options, 
#including coloring nodes by their class (or value for regression) 
#and using explicit variable and class names if desired. 
#IPython notebooks can also render these plots inline using the Image() function:


"""from IPython.display import Image  
dot_data = tree.export_graphviz(clf, out_file=None, 
                         feature_names= list(x_train.columns[1:]), #iris.feature_names,  
                         class_names= ["Survived"], #iris.target_names,  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = pydotplus.graph_from_dot_data(dot_data)  
Image(graph.create_png())"""









    Out[19]:





'from IPython.display import Image  \ndot_data = tree.export_graphviz(clf, out_file=None, \n                         feature_names= list(x_train.columns[1:]), #iris.feature_names,  \n                         class_names= ["Survived"], #iris.target_names,  \n                         filled=True, rounded=True,  \n                         special_characters=True)  \ngraph = pydotplus.graph_from_dot_data(dot_data)  \nImage(graph.create_png())'



In [20]:

    
print("accuracy score: ", clf.score(x_test,y_test))









    



accuracy score:  0.775119617225

Classification accuracy: percentage of correct predictions



In [21]:

    
#After being fitted, the model can then be used to predict the class of samples:
y_pred_class = clf.predict(x_test);

#Alternatively, the probability of each class can be predicted, 
#which is the fraction of training samples of the same class in a leaf:
clf.predict_proba(x_test);



In [22]:

    
# calculate accuracy
from sklearn import metrics

print(metrics.accuracy_score(y_test, y_pred_class))









    



0.775119617225

Null accuracy: accuracy that could be achieved by always predicting the most frequent class



In [23]:

    
# examine the class distribution of the testing set (using a Pandas Series method)
y_test.value_counts()









    Out[23]:





0    266
1    152
Name: Survived, dtype: int64



In [24]:

    
# calculate the percentage of ones
y_test.mean()









    Out[24]:





0.36363636363636365



In [25]:

    
# calculate the percentage of zeros
1 - y_test.mean()









    Out[25]:





0.63636363636363635



In [26]:

    
# calculate null accuracy (for binary classification problems coded as 0/1)
max(y_test.mean(), 1 - y_test.mean())









    Out[26]:





0.63636363636363635



In [27]:

    
# calculate null accuracy (for multi-class classification problems)
y_test.value_counts().head(1) / len(y_test)









    Out[27]:





0    0.636364
Name: Survived, dtype: float64

Comparing the true and predicted response values



In [28]:

    
# print the first 25 true and predicted responses
from __future__ import print_function
print('True:', y_test.values[0:25])
print('Pred:', y_pred_class[0:25])









    



True: [0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 0 1]
Pred: [0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1]

Conclusion: ???

Classification accuracy is the easiest classification metric to understand
But, it does not tell you the underlying distribution of response values
And, it does not tell you what "types" of errors your classifier is making

Confusion matrix

Table that describes the performance of a classification model



In [29]:

    
# IMPORTANT: first argument is true values, second argument is predicted values
print(metrics.confusion_matrix(y_test, y_pred_class))

Basic terminology

True Positives (TP): we correctly predicted that they do have diabetes
True Negatives (TN): we correctly predicted that they don't have diabetes
False Positives (FP): we incorrectly predicted that they do have diabetes (a "Type I error")
False Negatives (FN): we incorrectly predicted that they don't have diabetes (a "Type II error")



In [30]:

    
# save confusion matrix and slice into four pieces
confusion = metrics.confusion_matrix(y_test, y_pred_class)
TP = confusion[1, 1]
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]



In [31]:

    
print(TP, TN, FP, FN)









    



110 214 52 42

Metrics computed from a confusion matrix

Classification Accuracy: Overall, how often is the classifier correct?



In [32]:

    
print((TP + TN) / float(TP + TN + FP + FN))
print(metrics.accuracy_score(y_test, y_pred_class))









    



0.775119617225
0.775119617225

Classification Error: Overall, how often is the classifier incorrect?

Also known as "Misclassification Rate"



In [33]:

    
print((FP + FN) / float(TP + TN + FP + FN))
print(1 - metrics.accuracy_score(y_test, y_pred_class))









    



0.224880382775
0.224880382775

Specificity: When the actual value is negative, how often is the prediction correct?

How "specific" (or "selective") is the classifier in predicting positive instances?



In [34]:

    
print(TN / float(TN + FP))









    



0.804511278195

False Positive Rate: When the actual value is negative, how often is the prediction incorrect?



In [35]:

    
print(FP / float(TN + FP))









    



0.195488721805

Precision: When a positive value is predicted, how often is the prediction correct?

How "precise" is the classifier when predicting positive instances?



In [36]:

    
print(TP / float(TP + FP))
print(metrics.precision_score(y_test, y_pred_class))









    



0.679012345679
0.679012345679



In [37]:

    
print("Presicion: ", metrics.precision_score(y_test, y_pred_class))
print("Recall: ", metrics.recall_score(y_test, y_pred_class))
print("F1 score: ", metrics.f1_score(y_test, y_pred_class))









    



Presicion:  0.679012345679
Recall:  0.723684210526
F1 score:  0.700636942675

Many other metrics can be computed: F1 score, Matthews correlation coefficient, etc.

Conclusion:

Confusion matrix gives you a more complete picture of how your classifier is performing
Also allows you to compute various classification metrics, and these metrics can guide your model selection

Which metrics should you focus on?

Choice of metric depends on your business objective
Spam filter (positive class is "spam"): Optimize for precision or specificity because false negatives (spam goes to the inbox) are more acceptable than false positives (non-spam is caught by the spam filter)
Fraudulent transaction detector (positive class is "fraud"): Optimize for sensitivity because false positives (normal transactions that are flagged as possible fraud) are more acceptable than false negatives (fraudulent transactions that are not detected)

Support Vector Machine (SVM)

Linear Support Vector Classification.

Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.

Ref: http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC



In [38]:

    
from sklearn import svm

model = svm.LinearSVC()

model.fit(x_train, y_train)









    Out[38]:





LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)



In [39]:

    
acc_score = model.score(x_test, y_test)

print("Accuracy score: ", acc_score)









    



Accuracy score:  0.593301435407



In [40]:

    
y_pred_class = model.predict(x_test)



In [41]:

    
from sklearn import metrics



In [42]:

    
confusion_matrix = metrics.confusion_matrix(y_test, y_pred_class)

print(confusion_matrix)

Classifier comparison

http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

A comparison of a several classifiers in scikit-learn on synthetic datasets. The point of this example is to illustrate the nature of decision boundaries of different classifiers. This should be taken with a grain of salt, as the intuition conveyed by these examples does not necessarily carry over to real datasets.

Particularly in high-dimensional spaces, data can more easily be separated linearly and the simplicity of classifiers such as naive Bayes and linear SVMs might lead to better generalization than is achieved by other classifiers.

The plots show training points in solid colors and testing points semi-transparent. The lower right shows the classification accuracy on the test set.



In [43]:

    
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from matplotlib.colors import ListedColormap



In [ ]:



In [44]:

    
#classifiers



In [45]:

    
#x_train



In [46]:

    
#sns.pairplot(x_train)



In [47]:

    
x_train_scaled = StandardScaler().fit_transform(x_train)

x_test_scaled = StandardScaler().fit_transform(x_test)



In [48]:

    
x_train_scaled[0]









    Out[48]:





array([ 0.02743953, -0.187067  , -0.64095047, -0.59652571,  0.05850706,
       -0.50278454, -0.4549534 ,  0.73833521,  0.82816049,  1.10507752,
        0.1608944 ])



In [49]:

    
len(x_train_scaled[0])









    Out[49]:





11



In [50]:

    
df_x_train_scaled = pd.DataFrame(columns=x_train.columns, data=x_train_scaled)



In [51]:

    
df_x_train_scaled.head()









    Out[51]:







  
    
      
      AgeClass
      AgeClassSquared
      AgeSquared
      AgeUsingMeanTitle
      FamilySize
      Fare
      FarePerPerson
      Male
      Pclass
      SexClass
      Title
    
  
  
    
      0
      0.027440
      -0.187067
      -0.640950
      -0.596526
      0.058507
      -0.502785
      -0.454953
      0.738335
      0.828160
      1.105078
      0.160894
    
    
      1
      -0.820101
      -0.747159
      0.436930
      0.633468
      0.058507
      0.785958
      0.438395
      -1.354398
      -1.564901
      -1.175106
      0.160894
    
    
      2
      0.390671
      0.145295
      -0.425374
      -0.289027
      -0.561389
      -0.489199
      -0.334972
      -1.354398
      0.828160
      -1.175106
      0.160894
    
    
      3
      -0.910909
      -0.789281
      0.191038
      0.402844
      0.058507
      0.419998
      0.184714
      -1.354398
      -1.564901
      -1.175106
      0.160894
    
    
      4
      1.207943
      1.095643
      0.191038
      0.402844
      -0.561389
      -0.486684
      -0.331484
      0.738335
      0.828160
      1.105078
      0.160894



In [52]:

    
#sns.pairplot(df_x_train_scaled)



In [53]:

    
names = ["Nearest Neighbors", "Linear SVM", "RBF SVM",
         "Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
         "Naive Bayes", "QDA", "Gaussian Process"]

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1),
    AdaBoostClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis()
    #, GaussianProcessClassifier(1.0 * RBF(1.0), warm_start=True), # Take too long...
    ]

# iterate over classifiers
for name, model in zip(names, classifiers):
    model.fit(x_train_scaled, y_train)
    acc_score = model.score(x_test_scaled, y_test)
    print(name, " - accuracy score: ", acc_score)
#end for









    



Nearest Neighbors  - accuracy score:  0.777511961722
Linear SVM  - accuracy score:  1.0
RBF SVM  - accuracy score:  0.877990430622
Decision Tree  - accuracy score:  0.937799043062
Random Forest  - accuracy score:  0.856459330144
Neural Net  - accuracy score:  0.911483253589
AdaBoost  - accuracy score:  0.88038277512
Naive Bayes  - accuracy score:  0.827751196172
QDA  - accuracy score:  0.777511961722



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

Decision Tree Regressor

Ref: http://scikit-learn.org/stable/modules/tree.html

Decision trees can also be applied to regression problems, using the DecisionTreeRegressor class.

As in the classification setting, the fit method will take as argument arrays X and y, only that in this case y is expected to have floating point values instead of integer values:



In [54]:

    
from sklearn import tree


clf = tree.DecisionTreeRegressor()
clf = clf.fit(x_train, y_train)

clf.score(x_test,y_test)









    Out[54]:





0.018129242376775268



In [55]:

    
#clf.predict(x_test)

Random Forests



In [ ]:



In [ ]:

Naive Bayes



In [ ]:

Simple Linear Regression

Recall that Simple Linear Regression is given by the following equation: $y = \alpha + \beta x$

Our goal is to solve the values $\alpha$ and $\beta$ that minimize the cost function.

$$\beta = \frac{cov(x,y)}{var(x)}$$

where $cov(x,y)$ denotes a measure of how far a set of values is spread out.

Note that:

Variance is zero if all of the features are spread out equally.
A SMALL variance indicates that the numbers are NEAR the mean of the set
A LARGE variance when the numbers are FAR the mean of the set

$$var(x) = \frac{\sum\limits_{i=1}^{n}{\left( x_i - \overline{x} \right)}}{n-1}$$$$cov(x,y) = \frac{\sum\limits_{i=1}^{n}{\left( x_i - \overline{x} \right)\left( y_i - \overline{y} \right)}}{n-1}$$

Having solved $\beta$, we can estimate $\alpha$ using the following formula: $$\alpha = \overline{y} - \beta \overline{x}$$

Evaluating the Model

Using r-squared - that measures how well the observed values of the response variables are predicted by the model. In the case of simple linear regression, r-squared is equal to Pearson's r. In this method, r-squared must be a positive number between zero and one. In others, r-squared can return a negative number if the model performs extremely poorly.



In [56]:

    
from sklearn.linear_model import LinearRegression



In [57]:

    
model = LinearRegression()

model.fit(x_train, y_train)

r_squared = model.score(x_test, y_test)

print("R-squared: %.4f" %r_squared)









    



R-squared: 0.6787

Multiple Linear Regresssion

Formally, multiple linear regression is the following model:

$$y = \alpha+\beta_1x_1+\beta_2x_2+...+\beta_nx_n$$

$$Y = X\beta$$

where $Y$ denotes a column vector of the values of the response variables for training, $\beta$ denotes a column vector of the values of the model's parameters, $X$ is called the design matrix, an $m \times n$ dimensional matrix of the values of the features.

We can solve $\beta$ as follows:

$$\beta = \left( X^TX \right)^{-1}X^TY$$

Note that - code python:

from numpy import dot, transpose
beta = dot(inv(dot(transpose(X),X)), dot(transpose(X), Y))



In [58]:

    
from sklearn.linear_model import LinearRegression



In [59]:

    
model = LinearRegression()

model.fit(x_train, y_train)

predictions = model.predict(x_test)



In [60]:

    
#for i in range(predictions.size):
#    print("Predicted: %.2f, Target: %.2f" %(predictions[i], y_test[i]))

r_squared = model.score(x_test, y_test)
    
print("R-squared: %.4f" %r_squared)









    



R-squared: 0.6787

Polynomialy Regression

Quadratic Regression, regession with a second order polynomial, is given by the following formula:

$$y = \alpha +\beta_1x^1+\beta_2x^2$$



In [61]:

    
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures



In [62]:

    
model = LinearRegression()

model.fit(x_train, y_train)









    Out[62]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [63]:

    
xx = np.linspace(0, 26, 100)
#yy = np.linspace(0, 26, 100)

#yy = model.predict(xx.reshape(xx.shape[0],1))

#plt.plot(xx, yy)



In [64]:

    
quadratic_featurizer = PolynomialFeatures(degree=2)

x_train_quadratic = quadratic_featurizer.fit_transform(x_train)
x_test_quadratic = quadratic_featurizer.fit(x_test)



In [65]:

    
x_train.head()









    Out[65]:







  
    
      
      AgeClass
      AgeClassSquared
      AgeSquared
      AgeUsingMeanTitle
      FamilySize
      Fare
      FarePerPerson
      Male
      Pclass
      SexClass
      Title
    
  
  
    
      0
      66.0
      4356.0
      484.0
      22.0
      1
      7.2500
      3.62500
      1
      3
      3
      3
    
    
      1
      38.0
      1444.0
      1444.0
      38.0
      1
      71.2833
      35.64165
      0
      1
      0
      3
    
    
      2
      78.0
      6084.0
      676.0
      26.0
      0
      7.9250
      7.92500
      0
      3
      0
      3
    
    
      3
      35.0
      1225.0
      1225.0
      35.0
      1
      53.1000
      26.55000
      0
      1
      0
      3
    
    
      4
      105.0
      11025.0
      1225.0
      35.0
      0
      8.0500
      8.05000
      1
      3
      3
      3



In [66]:

    
model_quadratic = LinearRegression()

model_quadratic.fit(x_train_quadratic, y_train)

#predictions = model_quadratic.predict(x_test_quadratic)

#r_squared = model_quadratic.score(x_test_quadratic, y_test)

#r_squared
    
#print("R-squared: %.4f" %r_squared)









    Out[66]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [ ]:



In [ ]:



In [ ]:



In [ ]:

Linear Regression 2



In [ ]:

Logistic Regression



In [ ]:

SVM



In [ ]:

KNN (K- Nearest Neighbors)



In [ ]:



In [ ]:



In [ ]:

	AgeClass	AgeClassSquared	AgeSquared	AgeUsingMeanTitle	FamilySize	Fare	FarePerPerson	Male	Pclass	SexClass	Title
0	66.0	4356.0	484.0	22.0	1	7.2500	3.62500	1	3	3	3
1	38.0	1444.0	1444.0	38.0	1	71.2833	35.64165	0	1	0	3
2	78.0	6084.0	676.0	26.0	0	7.9250	7.92500	0	3	0	3
3	35.0	1225.0	1225.0	35.0	1	53.1000	26.55000	0	1	0	3
4	105.0	11025.0	1225.0	35.0	0	8.0500	8.05000	1	3	3	3

	AgeClass	AgeClassSquared	AgeSquared	AgeUsingMeanTitle	FamilySize	Fare	FarePerPerson	Male	Pclass	SexClass	Title
885	117.000000	13689.00000	1521.000000	39.000000	5	29.125	4.854167	0	3	0	3
886	54.000000	2916.00000	729.000000	27.000000	0	13.000	13.000000	1	2	2	0
887	19.000000	361.00000	361.000000	19.000000	0	30.000	30.000000	0	1	0	3
888	86.061263	7406.54097	822.948997	28.687088	3	23.450	5.862500	0	3	0	3
889	26.000000	676.00000	676.000000	26.000000	0	30.000	30.000000	1	1	1	3

	AgeClass	AgeClassSquared	AgeSquared	AgeUsingMeanTitle	FamilySize	Fare	FarePerPerson	Male	Pclass	SexClass	Title
0	0.027440	-0.187067	-0.640950	-0.596526	0.058507	-0.502785	-0.454953	0.738335	0.828160	1.105078	0.160894
1	-0.820101	-0.747159	0.436930	0.633468	0.058507	0.785958	0.438395	-1.354398	-1.564901	-1.175106	0.160894
2	0.390671	0.145295	-0.425374	-0.289027	-0.561389	-0.489199	-0.334972	-1.354398	0.828160	-1.175106	0.160894
3	-0.910909	-0.789281	0.191038	0.402844	0.058507	0.419998	0.184714	-1.354398	-1.564901	-1.175106	0.160894
4	1.207943	1.095643	0.191038	0.402844	-0.561389	-0.486684	-0.331484	0.738335	0.828160	1.105078	0.160894