Traditional Classifiers

Classification problems have long been at the heart of applied statistics, but it wasn't until the advent of the computer that their power really began to manifest. Some of the simplest algorithms are Naive Bayes, K-nearest Neighbors, and Logistic Regression. Calling these algorithms "simple" by no means implies that they are ineffective, in fact, due to their simplicity and speed, they are often employed in workflows. In this exercise we'll explore these classifiers and compare the performance against one another.

Naive Bayes

Naive Bayes classifiers have been studied extensively since the 1950s and 1960s, notably in the domain of text classification, being especially useful for email spam detection. They work by assuming a prior probability distribution in each of the classes, and then update each distribution given the training data using Bayes's Theorem given by $${\displaystyle p(C_{k}\mid \mathbf {x} )={\frac {p(C_{k})\ p(\mathbf {x} \mid C_{k})}{p(\mathbf {x} )}}\,}$$ where $C_{k}$ is the kth class, and $x$ is the training data. The "naive" part of the algorithm comes into play with the assumption that each data point is independent of the previous, leading to the nice property that the joint pobability can be computed as the simple product. New data is then classified by implementing a decision rule, typically the probabiliy of the new data point belonging to each class is computed and the largest is chosen.

1 - Head over to the Machine Learning Repository, download the SMS Spam Collection Data Set, put it into a dataframe, process it as a bag-of-words using word counts, and split into training and test sets. Be sure to familiarize yourself with the data before proceeding.



In [1]:

    
import pandas as pd
import numpy as np



In [2]:

    
# read data
spam = pd.read_csv('SMSSpamCollection', sep='\t', header=None, encoding='latin-1')
spam.columns = ['class', 'text']



In [3]:

    
# create bag-of-words counts
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
wordcount = vec.fit_transform(spam.text)
# create X and y matrices
X_spam = wordcount.toarray()
y_spam = np.array(spam['class'] == 'spam', dtype=int)

In the solution: first split then fitting transform to train and applying to test:

(Also, I don't need to transform y in 1s and 0s array.



In [4]:

    
# cv = CountVectorizer()
# X_train = cv.fit_transform(X_train)
# X_test = cv.fit_transform(X_test)



In [5]:

    
# train and test split
from sklearn.model_selection import train_test_split
X_spam_train, X_spam_test, y_spam_train, y_spam_test = train_test_split(X_spam, y_spam, test_size=0.3, random_state = 0)

2 - Which of the available Naive Bayes Classifiers is most appropriate to the data? Choose one and fit it to the data, using the default hyperparameter settings, and report the training and testing accuracies. Comment on your results.

I would use a multinomial classifier, since we are dealing with counts:



In [6]:

    
from sklearn.naive_bayes import MultinomialNB

NB = MultinomialNB()
NB.fit(X_spam_train, y_spam_train)









    Out[6]:





MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)



In [7]:

    
print('Training accuracy: {:.5f}'.format(NB.score(X_spam_train, y_spam_train)))
print('Testing accuracy: {:.5f}'.format(NB.score(X_spam_test, y_spam_test)))









    



Training accuracy: 0.99462
Testing accuracy: 0.98266

Seems a pretty good accuracy, in and out of sample.

3 - Try a few different settings of the alpha parameter, printing the training and testing accuracies. Also try switching the fit_prior parameter on and off. Comment on your results.



In [8]:

    
# alpha can be greater than 1!
NBparams = pd.DataFrame(columns=['alpha', 'fit_prior', 'train_accuracy', 'test_accuracy'])

for alpha in np.arange(0, 1.1, 0.1):
    for fit_prior in [True, False]:
        NB.set_params(alpha=alpha, fit_prior=fit_prior)
        NB.fit(X_spam_train, y_spam_train)
        NBparams = NBparams.append(pd.DataFrame([[alpha,
                                                  fit_prior,
                                                  NB.score(X_spam_train, y_spam_train),
                                                  NB.score(X_spam_test, y_spam_test)]],
                                                columns=['alpha', 'fit_prior', 'train_accuracy', 'test_accuracy']))









    



C:\Users\alessandro.diantonio\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log
  self.feature_log_prob_ = (np.log(smoothed_fc) -
C:\Users\alessandro.diantonio\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\naive_bayes.py:699: RuntimeWarning: divide by zero encountered in log
  self.feature_log_prob_ = (np.log(smoothed_fc) -



In [9]:

    
NBparams.sort_values(by='train_accuracy', ascending=False)









    Out[9]:






  
    
      
      alpha
      fit_prior
      train_accuracy
      test_accuracy
    
  
  
    
      0
      0.1
      True
      0.996923
      0.980861
    
    
      0
      0.2
      True
      0.996667
      0.982057
    
    
      0
      0.3
      True
      0.995385
      0.982656
    
    
      0
      0.4
      True
      0.995385
      0.983254
    
    
      0
      0.5
      True
      0.994615
      0.983852
    
    
      0
      1.0
      True
      0.994615
      0.982656
    
    
      0
      0.6
      True
      0.994359
      0.983254
    
    
      0
      0.7
      True
      0.994103
      0.982656
    
    
      0
      0.9
      True
      0.993846
      0.981459
    
    
      0
      0.8
      True
      0.993846
      0.982057
    
    
      0
      0.1
      False
      0.993333
      0.971890
    
    
      0
      0.2
      False
      0.991282
      0.972488
    
    
      0
      0.3
      False
      0.990769
      0.970096
    
    
      0
      0.4
      False
      0.990513
      0.968301
    
    
      0
      0.5
      False
      0.990000
      0.967703
    
    
      0
      0.6
      False
      0.987692
      0.968301
    
    
      0
      0.8
      False
      0.987692
      0.967703
    
    
      0
      0.7
      False
      0.987179
      0.968301
    
    
      0
      0.9
      False
      0.986667
      0.968301
    
    
      0
      1.0
      False
      0.986410
      0.968301
    
    
      0
      0.0
      False
      0.865128
      0.867823
    
    
      0
      0.0
      True
      0.865128
      0.867823



In [10]:

    
NBparams.sort_values(by='test_accuracy', ascending=False)









    Out[10]:






  
    
      
      alpha
      fit_prior
      train_accuracy
      test_accuracy
    
  
  
    
      0
      0.5
      True
      0.994615
      0.983852
    
    
      0
      0.4
      True
      0.995385
      0.983254
    
    
      0
      0.6
      True
      0.994359
      0.983254
    
    
      0
      0.3
      True
      0.995385
      0.982656
    
    
      0
      1.0
      True
      0.994615
      0.982656
    
    
      0
      0.7
      True
      0.994103
      0.982656
    
    
      0
      0.2
      True
      0.996667
      0.982057
    
    
      0
      0.8
      True
      0.993846
      0.982057
    
    
      0
      0.9
      True
      0.993846
      0.981459
    
    
      0
      0.1
      True
      0.996923
      0.980861
    
    
      0
      0.2
      False
      0.991282
      0.972488
    
    
      0
      0.1
      False
      0.993333
      0.971890
    
    
      0
      0.3
      False
      0.990769
      0.970096
    
    
      0
      1.0
      False
      0.986410
      0.968301
    
    
      0
      0.7
      False
      0.987179
      0.968301
    
    
      0
      0.4
      False
      0.990513
      0.968301
    
    
      0
      0.6
      False
      0.987692
      0.968301
    
    
      0
      0.9
      False
      0.986667
      0.968301
    
    
      0
      0.5
      False
      0.990000
      0.967703
    
    
      0
      0.8
      False
      0.987692
      0.967703
    
    
      0
      0.0
      False
      0.865128
      0.867823
    
    
      0
      0.0
      True
      0.865128
      0.867823

The best models all have the fit prior parameter set to True. Also, we see that the best train accuracy is obtained with small $\alpha$, but that doesn't translate to the best test accuracies that are, in general, obtained with $\alpha$ near to 1. The documentation says that "the smoothing priors $\alpha \ge 0$ accounts for features not present in the learning samples and prevents zero probabilities in further computations"; so I would say that small values of $\alpha$ cause some overfitting of the train set while higher values works better because we have a sparse matrix of features.

That being said, the accuracy is high in general.

We can also observe that the models trained with $\alpha = 0$ are by far the worst ones.

4 - Generate a word cloud for the ham and spam words.



In [11]:

    
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline



In [12]:

    
from wordcloud import WordCloud
# counting ham words
hamvec = CountVectorizer()
hamcount = hamvec.fit_transform(spam.loc[spam['class'] == 'ham', 'text'].values)
# creating a series containing word and relative count
hamcountdict = {}
for word, count in zip(hamvec.get_feature_names(), np.asarray(hamcount.sum(axis=0)).ravel()):
    hamcountdict[word] = count
hamcountseries = pd.Series(hamcountdict)
hamcountseries / hamcountseries.sum()

# counting spam words
spamvec = CountVectorizer()
spamcount = spamvec.fit_transform(spam.loc[spam['class'] == 'spam', 'text'].values)
# creating a series containing word and relative count
spamcountdict = {}
for word, count in zip(spamvec.get_feature_names(), np.asarray(spamcount.sum(axis=0)).ravel()):
    spamcountdict[word] = count
spamcountseries = pd.Series(spamcountdict)
spamcountseries / spamcountseries.sum()

# creating two wordclouds from the frequencies calculated on the series
wordcloudham = WordCloud(background_color='white', max_words=len(hamvec.vocabulary_)).generate_from_frequencies(hamcountseries / hamcountseries.sum())
wordcloudspam = WordCloud(background_color='white', max_words=len(spamvec.vocabulary_)).generate_from_frequencies(spamcountseries / spamcountseries.sum())



In [13]:

    
# # solution:

# import nltk
# from nltk.corpus import stopwordswords

# # Split ham and spam rows
# ham = data[data['class'] == 'ham'].text
# spam = data[data['class'] == 'spam'].text

# # Get counts, removing stopwords
# ham_words = ''
# for row in ham:
#     test = row.lower()
#     tokens = nltk.word_tokenize(text)
#     for words in tokens:
#         ham_words = ham_words + words + ' '

# spam_words = ''
# for row in spam:
#     test = row.lower()
#     tokens = nltk.word_tokenize(text)
#     for words in tokens:
#         spam_words = spam_words + words + ' '

# # Generate a word cloud image
# ham_wordcloud = WordCloud(width=600, height=400).generate(ham_words)
# spam_wordcloud = WordCloud(width=600, height=400).generate(spam_words)



In [14]:

    
size = 20
fig = plt.figure(figsize=(size, size))
ax = plt.axes()
ax.set_title('Ham Words', size=40)
ax.imshow(wordcloudham, interpolation='bilinear')
ax.axis('off');



In [15]:

    
size = 20
fig = plt.figure(figsize=(size, size))
ax = plt.axes()
ax.set_title('Spam Words', size=40)
ax.imshow(wordcloudspam, interpolation='bilinear')
ax.axis('off');

5 - Naive Bayes can also be used for classification of continous data. Generate a set of random data points, $x1$ and $x2$, classifiying each point as being above or below the line f(x) = x, fit a Naive Bayes model, report the accuracy, and plot the decision boundary. Comment on your results.



In [16]:

    
# generate data
np.random.seed(687)
x1 = np.random.rand(100)
x2 = np.random.rand(100)

X_cont = np.vstack((x1, x2)).T
y_cont = np.array((x1 >= x2), dtype=int)



In [17]:

    
from sklearn.naive_bayes import GaussianNB
# fit model
NB_cont = GaussianNB()
NB_cont.fit(X_cont, y_cont)
# get accuracy
NB_cont.score(X_cont, y_cont)









    Out[17]:





0.94999999999999996



In [18]:

    
def plot_decision_boundary(X, y, classifier, resolution):
    # plot decision boundaries
    cmap = plt.cm.get_cmap('viridis')
    markers = ('o', '^', 's', 'x', 'v')
    x1_min, x1_max = X[:, 0].min() - 0.1, X[:, 0].max() + 0.1
    x2_min, x2_max = X[:, 1].min() - 0.1, X[:, 1].max() + 0.1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
                          np.arange(x2_min, x2_max, resolution))
    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)
    plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())
    # scatter points
    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y == cl, 0],
                    y=X[y == cl, 1],
                    alpha=0.6,
                    c=np.array(cmap.colors[idx]),
                    edgecolor='black',
                    marker=markers[idx],
                    label='cl'
                   )



In [19]:

    
plot_decision_boundary(X_cont, y_cont, NB_cont, 0.002)

K-Nearest Neighbor

Another one of the simplest of classificaiton algorithms is the K-Neareast Neighbors. KNN works by simply classifying a datapoint by its proximity to other datapoints of known category. There are a few different evaluation metrics to choose for "distance" as well as different implementations of the algorithm, but the idea is always the same.

1 - Generate a set of 100 random data points according to the function ${\displaystyle f(x) = N(x\;|\;\mu ,\sigma ^{2}) + \epsilon}$ where $\mu = 0$, $\sigma = 1$, and $\epsilon$ is a noise term. Then classify each point as being above or below the curve, fit a KNN classifier, using the default values, and plot the decision boundary and report the accuracy. Comment on your results.



In [20]:

    
from scipy.stats import norm
# generate data
np.random.seed(459)
a = -3
b = 3
# in solution it uses a linspace between -3 and 3
x1_KNN = (b - a)*np.random.rand(100) + a
a = -0.5
b = 0.5
noise = (b - a)*np.random.rand(100) + a
x2_KNN = norm.pdf(x1_KNN) + noise

X_KNN = np.vstack((x1_KNN, x2_KNN)).T
y_KNN = np.array((x2_KNN >= norm.pdf(x1_KNN)), dtype=int)



In [21]:

    
from sklearn.neighbors import KNeighborsClassifier
# fit model
KNN = KNeighborsClassifier()
KNN.fit(X_KNN, y_KNN)
# get accuracy
KNN.score(X_KNN, y_KNN)









    Out[21]:





0.95999999999999996



In [22]:

    
plot_decision_boundary(X_KNN, y_KNN, KNN, 0.002)

2 - Using the data generated in part (1), fit a series of KNN models, adjusting the n_neighbors parameter, and plot the decision boundary and accuracy for each. Comment on your results.



In [23]:

    
n_neighbors = np.arange(1, 11)
for n in n_neighbors:
    # set params
    KNN.set_params(n_neighbors=n)
    # fit
    KNN.fit(X_KNN, y_KNN)
    # get accuracy
    print('KNN with {} neighbors accuracy: {:.5f}'.format(n, KNN.score(X_KNN, y_KNN)))
    # plot boundaries
    fig, ax = plt.subplots(1, 1)
    ax.set_title('KNN with {} neighbors'.format(n))
    plot_decision_boundary(X_KNN, y_KNN, KNN, 0.002)









    



KNN with 1 neighbors accuracy: 1.00000
KNN with 2 neighbors accuracy: 0.96000
KNN with 3 neighbors accuracy: 0.95000
KNN with 4 neighbors accuracy: 0.95000
KNN with 5 neighbors accuracy: 0.96000
KNN with 6 neighbors accuracy: 0.95000
KNN with 7 neighbors accuracy: 0.94000
KNN with 8 neighbors accuracy: 0.91000
KNN with 9 neighbors accuracy: 0.87000
KNN with 10 neighbors accuracy: 0.87000

The best result is for 1 neighbors, but by looking at the decision boundaries we may get the best generalization for 2 neighbors. We also get good results for $n$ between 2 and 7, while performances decrease for 8 or more neighbors.

3 - Head over to the Machine Learning Repository, download the Connectionist Bench (Sonar, Mines vs. Rocks) Data Set, put it into a data frame, and split into training and testing sets. Be sure to familiarize yourself with the data before proceeding.



In [24]:

    
sonar = pd.read_csv('sonar.all-data', header=None)
sonar.columns = ['freq'+str(i) for i in range(60)] + ['class']

print(sonar.shape)
sonar.head()









    



(208, 61)






    Out[24]:






  
    
      
      freq0
      freq1
      freq2
      freq3
      freq4
      freq5
      freq6
      freq7
      freq8
      freq9
      ...
      freq51
      freq52
      freq53
      freq54
      freq55
      freq56
      freq57
      freq58
      freq59
      class
    
  
  
    
      0
      0.0200
      0.0371
      0.0428
      0.0207
      0.0954
      0.0986
      0.1539
      0.1601
      0.3109
      0.2111
      ...
      0.0027
      0.0065
      0.0159
      0.0072
      0.0167
      0.0180
      0.0084
      0.0090
      0.0032
      R
    
    
      1
      0.0453
      0.0523
      0.0843
      0.0689
      0.1183
      0.2583
      0.2156
      0.3481
      0.3337
      0.2872
      ...
      0.0084
      0.0089
      0.0048
      0.0094
      0.0191
      0.0140
      0.0049
      0.0052
      0.0044
      R
    
    
      2
      0.0262
      0.0582
      0.1099
      0.1083
      0.0974
      0.2280
      0.2431
      0.3771
      0.5598
      0.6194
      ...
      0.0232
      0.0166
      0.0095
      0.0180
      0.0244
      0.0316
      0.0164
      0.0095
      0.0078
      R
    
    
      3
      0.0100
      0.0171
      0.0623
      0.0205
      0.0205
      0.0368
      0.1098
      0.1276
      0.0598
      0.1264
      ...
      0.0121
      0.0036
      0.0150
      0.0085
      0.0073
      0.0050
      0.0044
      0.0040
      0.0117
      R
    
    
      4
      0.0762
      0.0666
      0.0481
      0.0394
      0.0590
      0.0649
      0.1209
      0.2467
      0.3564
      0.4459
      ...
      0.0031
      0.0054
      0.0105
      0.0110
      0.0015
      0.0072
      0.0048
      0.0107
      0.0094
      R
    
  

5 rows × 61 columns



In [25]:

    
sonar.describe().T



In [26]:

    
sonar.loc[sonar['class'] == 'R'].describe().T



In [27]:

    
sonar.loc[sonar['class'] == 'M'].describe().T



In [28]:

    
X_sonar = sonar.iloc[:, :-1].values
y_sonar = sonar.iloc[:, -1].values

X_sonar_train, X_sonar_test, y_sonar_train, y_sonar_test = train_test_split(X_sonar, y_sonar, test_size=0.3, random_state = 42)

4 - Fit a KNN classifier using the default settings and report the training and testing accuracies.



In [29]:

    
KNN_sonar = KNeighborsClassifier()
KNN_sonar.fit(X_sonar_train, y_sonar_train)
print('Train accuracy: {:.5f}'.format(KNN_sonar.score(X_sonar_train, y_sonar_train)))
print('Test accuracy: {:.5f}'.format(KNN_sonar.score(X_sonar_test, y_sonar_test)))
# for random_State=42 test greater than train? mmh, strange... but for random_state=0 it's smaller









    



Train accuracy: 0.80690
Test accuracy: 0.84127

5 - Fit a series of KNN classifiers to the data, adjusting the n_neighbors parameter, reporting the training and testing accuracies for each. Comment on your results.



In [29]:

    
n_neighbors = np.arange(1, 11)
for n in n_neighbors:
    # set params
    KNN_sonar.set_params(n_neighbors=n)
    # fit
    KNN_sonar.fit(X_sonar_train, y_sonar_train)
    # get accuracies
    print('KNN with {} neighbors train accuracy: {:.5f}'.format(n, KNN_sonar.score(X_sonar_train, y_sonar_train)))
    print('KNN with {} neighbors test accuracy: {:.5f}'.format(n, KNN_sonar.score(X_sonar_test, y_sonar_test)))









    



KNN with 1 neighbors train accuracy: 1.00000
KNN with 1 neighbors test accuracy: 0.90476
KNN with 2 neighbors train accuracy: 0.86207
KNN with 2 neighbors test accuracy: 0.84127
KNN with 3 neighbors train accuracy: 0.86207
KNN with 3 neighbors test accuracy: 0.88889
KNN with 4 neighbors train accuracy: 0.82759
KNN with 4 neighbors test accuracy: 0.85714
KNN with 5 neighbors train accuracy: 0.80690
KNN with 5 neighbors test accuracy: 0.84127
KNN with 6 neighbors train accuracy: 0.77931
KNN with 6 neighbors test accuracy: 0.73016
KNN with 7 neighbors train accuracy: 0.79310
KNN with 7 neighbors test accuracy: 0.74603
KNN with 8 neighbors train accuracy: 0.74483
KNN with 8 neighbors test accuracy: 0.73016
KNN with 9 neighbors train accuracy: 0.75862
KNN with 9 neighbors test accuracy: 0.73016
KNN with 10 neighbors train accuracy: 0.69655
KNN with 10 neighbors test accuracy: 0.74603

We have this strange phenomenon (to me at least) that for $n$ between 3 and 5 and $n = 10$ the test accuracy is greater than the train one.

As before, the best performance is for 1 neighbor and all the small $n$ give good accuracy scores.

6 - Repeat part (5), but this time adjusting the metric parameter. Comment on your results.



In [30]:

    
n_neighbors = np.arange(1, 11)
for n in n_neighbors:
    # set params
    metric = 'chebyshev'
    KNN_sonar.set_params(n_neighbors=n, metric=metric)
    # fit
    KNN_sonar.fit(X_sonar_train, y_sonar_train)
    # get accuracies
    print('KNN with {} neighbors and {} metric train accuracy: {:.5f}'.format(n,
                                                                              metric,
                                                                              KNN_sonar.score(X_sonar_train, y_sonar_train)))
    print('KNN with {} neighbors and {} metric test accuracy: {:.5f}'.format(n,
                                                                             metric,
                                                                             KNN_sonar.score(X_sonar_test, y_sonar_test)))
    for p in np.arange(1, 6):
        # set params
        metric = 'minkowski'
        KNN_sonar.set_params(metric=metric, p=p)
        # fit
        KNN_sonar.fit(X_sonar_train, y_sonar_train)
        # get accuracies
        print('KNN with {} neighbors and {} {}-metric train accuracy: {:.5f}'.format(n,
                                                                                     metric,
                                                                                     p,
                                                                                     KNN_sonar.score(X_sonar_train, y_sonar_train)))
        print('KNN with {} neighbors and {} {}-metric test accuracy: {:.5f}'.format(n,
                                                                                    metric,
                                                                                    p,
                                                                                    KNN_sonar.score(X_sonar_test, y_sonar_test)))









    



KNN with 1 neighbors and chebyshev metric train accuracy: 1.00000
KNN with 1 neighbors and chebyshev metric test accuracy: 0.84127
KNN with 1 neighbors and minkowski 1-metric train accuracy: 1.00000
KNN with 1 neighbors and minkowski 1-metric test accuracy: 0.90476
KNN with 1 neighbors and minkowski 2-metric train accuracy: 1.00000
KNN with 1 neighbors and minkowski 2-metric test accuracy: 0.90476
KNN with 1 neighbors and minkowski 3-metric train accuracy: 1.00000
KNN with 1 neighbors and minkowski 3-metric test accuracy: 0.90476
KNN with 1 neighbors and minkowski 4-metric train accuracy: 1.00000
KNN with 1 neighbors and minkowski 4-metric test accuracy: 0.92063
KNN with 1 neighbors and minkowski 5-metric train accuracy: 1.00000
KNN with 1 neighbors and minkowski 5-metric test accuracy: 0.88889
KNN with 2 neighbors and chebyshev metric train accuracy: 0.82069
KNN with 2 neighbors and chebyshev metric test accuracy: 0.77778
KNN with 2 neighbors and minkowski 1-metric train accuracy: 0.88966
KNN with 2 neighbors and minkowski 1-metric test accuracy: 0.88889
KNN with 2 neighbors and minkowski 2-metric train accuracy: 0.86207
KNN with 2 neighbors and minkowski 2-metric test accuracy: 0.84127
KNN with 2 neighbors and minkowski 3-metric train accuracy: 0.88276
KNN with 2 neighbors and minkowski 3-metric test accuracy: 0.77778
KNN with 2 neighbors and minkowski 4-metric train accuracy: 0.86207
KNN with 2 neighbors and minkowski 4-metric test accuracy: 0.77778
KNN with 2 neighbors and minkowski 5-metric train accuracy: 0.86207
KNN with 2 neighbors and minkowski 5-metric test accuracy: 0.76190
KNN with 3 neighbors and chebyshev metric train accuracy: 0.86897
KNN with 3 neighbors and chebyshev metric test accuracy: 0.82540
KNN with 3 neighbors and minkowski 1-metric train accuracy: 0.88276
KNN with 3 neighbors and minkowski 1-metric test accuracy: 0.90476
KNN with 3 neighbors and minkowski 2-metric train accuracy: 0.86207
KNN with 3 neighbors and minkowski 2-metric test accuracy: 0.88889
KNN with 3 neighbors and minkowski 3-metric train accuracy: 0.85517
KNN with 3 neighbors and minkowski 3-metric test accuracy: 0.90476
KNN with 3 neighbors and minkowski 4-metric train accuracy: 0.86207
KNN with 3 neighbors and minkowski 4-metric test accuracy: 0.85714
KNN with 3 neighbors and minkowski 5-metric train accuracy: 0.87586
KNN with 3 neighbors and minkowski 5-metric test accuracy: 0.82540
KNN with 4 neighbors and chebyshev metric train accuracy: 0.80000
KNN with 4 neighbors and chebyshev metric test accuracy: 0.73016
KNN with 4 neighbors and minkowski 1-metric train accuracy: 0.83448
KNN with 4 neighbors and minkowski 1-metric test accuracy: 0.87302
KNN with 4 neighbors and minkowski 2-metric train accuracy: 0.82759
KNN with 4 neighbors and minkowski 2-metric test accuracy: 0.85714
KNN with 4 neighbors and minkowski 3-metric train accuracy: 0.79310
KNN with 4 neighbors and minkowski 3-metric test accuracy: 0.84127
KNN with 4 neighbors and minkowski 4-metric train accuracy: 0.79310
KNN with 4 neighbors and minkowski 4-metric test accuracy: 0.80952
KNN with 4 neighbors and minkowski 5-metric train accuracy: 0.79310
KNN with 4 neighbors and minkowski 5-metric test accuracy: 0.74603
KNN with 5 neighbors and chebyshev metric train accuracy: 0.78621
KNN with 5 neighbors and chebyshev metric test accuracy: 0.74603
KNN with 5 neighbors and minkowski 1-metric train accuracy: 0.82069
KNN with 5 neighbors and minkowski 1-metric test accuracy: 0.87302
KNN with 5 neighbors and minkowski 2-metric train accuracy: 0.80690
KNN with 5 neighbors and minkowski 2-metric test accuracy: 0.84127
KNN with 5 neighbors and minkowski 3-metric train accuracy: 0.80000
KNN with 5 neighbors and minkowski 3-metric test accuracy: 0.84127
KNN with 5 neighbors and minkowski 4-metric train accuracy: 0.80690
KNN with 5 neighbors and minkowski 4-metric test accuracy: 0.80952
KNN with 5 neighbors and minkowski 5-metric train accuracy: 0.80690
KNN with 5 neighbors and minkowski 5-metric test accuracy: 0.80952
KNN with 6 neighbors and chebyshev metric train accuracy: 0.77241
KNN with 6 neighbors and chebyshev metric test accuracy: 0.69841
KNN with 6 neighbors and minkowski 1-metric train accuracy: 0.81379
KNN with 6 neighbors and minkowski 1-metric test accuracy: 0.79365
KNN with 6 neighbors and minkowski 2-metric train accuracy: 0.77931
KNN with 6 neighbors and minkowski 2-metric test accuracy: 0.73016
KNN with 6 neighbors and minkowski 3-metric train accuracy: 0.76552
KNN with 6 neighbors and minkowski 3-metric test accuracy: 0.73016
KNN with 6 neighbors and minkowski 4-metric train accuracy: 0.78621
KNN with 6 neighbors and minkowski 4-metric test accuracy: 0.76190
KNN with 6 neighbors and minkowski 5-metric train accuracy: 0.77931
KNN with 6 neighbors and minkowski 5-metric test accuracy: 0.74603
KNN with 7 neighbors and chebyshev metric train accuracy: 0.77931
KNN with 7 neighbors and chebyshev metric test accuracy: 0.68254
KNN with 7 neighbors and minkowski 1-metric train accuracy: 0.83448
KNN with 7 neighbors and minkowski 1-metric test accuracy: 0.79365
KNN with 7 neighbors and minkowski 2-metric train accuracy: 0.79310
KNN with 7 neighbors and minkowski 2-metric test accuracy: 0.74603
KNN with 7 neighbors and minkowski 3-metric train accuracy: 0.76552
KNN with 7 neighbors and minkowski 3-metric test accuracy: 0.71429
KNN with 7 neighbors and minkowski 4-metric train accuracy: 0.77241
KNN with 7 neighbors and minkowski 4-metric test accuracy: 0.73016
KNN with 7 neighbors and minkowski 5-metric train accuracy: 0.77241
KNN with 7 neighbors and minkowski 5-metric test accuracy: 0.73016
KNN with 8 neighbors and chebyshev metric train accuracy: 0.77931
KNN with 8 neighbors and chebyshev metric test accuracy: 0.69841
KNN with 8 neighbors and minkowski 1-metric train accuracy: 0.79310
KNN with 8 neighbors and minkowski 1-metric test accuracy: 0.74603
KNN with 8 neighbors and minkowski 2-metric train accuracy: 0.74483
KNN with 8 neighbors and minkowski 2-metric test accuracy: 0.73016
KNN with 8 neighbors and minkowski 3-metric train accuracy: 0.75172
KNN with 8 neighbors and minkowski 3-metric test accuracy: 0.73016
KNN with 8 neighbors and minkowski 4-metric train accuracy: 0.76552
KNN with 8 neighbors and minkowski 4-metric test accuracy: 0.73016
KNN with 8 neighbors and minkowski 5-metric train accuracy: 0.75172
KNN with 8 neighbors and minkowski 5-metric test accuracy: 0.71429
KNN with 9 neighbors and chebyshev metric train accuracy: 0.74483
KNN with 9 neighbors and chebyshev metric test accuracy: 0.65079
KNN with 9 neighbors and minkowski 1-metric train accuracy: 0.80000
KNN with 9 neighbors and minkowski 1-metric test accuracy: 0.77778
KNN with 9 neighbors and minkowski 2-metric train accuracy: 0.75862
KNN with 9 neighbors and minkowski 2-metric test accuracy: 0.73016
KNN with 9 neighbors and minkowski 3-metric train accuracy: 0.73793
KNN with 9 neighbors and minkowski 3-metric test accuracy: 0.66667
KNN with 9 neighbors and minkowski 4-metric train accuracy: 0.75862
KNN with 9 neighbors and minkowski 4-metric test accuracy: 0.68254
KNN with 9 neighbors and minkowski 5-metric train accuracy: 0.72414
KNN with 9 neighbors and minkowski 5-metric test accuracy: 0.68254
KNN with 10 neighbors and chebyshev metric train accuracy: 0.71724
KNN with 10 neighbors and chebyshev metric test accuracy: 0.66667
KNN with 10 neighbors and minkowski 1-metric train accuracy: 0.76552
KNN with 10 neighbors and minkowski 1-metric test accuracy: 0.74603
KNN with 10 neighbors and minkowski 2-metric train accuracy: 0.69655
KNN with 10 neighbors and minkowski 2-metric test accuracy: 0.74603
KNN with 10 neighbors and minkowski 3-metric train accuracy: 0.68966
KNN with 10 neighbors and minkowski 3-metric test accuracy: 0.68254
KNN with 10 neighbors and minkowski 4-metric train accuracy: 0.72414
KNN with 10 neighbors and minkowski 4-metric test accuracy: 0.69841
KNN with 10 neighbors and minkowski 5-metric train accuracy: 0.71724
KNN with 10 neighbors and minkowski 5-metric test accuracy: 0.71429

The best scores in general are for Minkovski's metrics with small $n$ and $p$ between 1 and 3. The best test accuracy although is for $n = 1$ and $p = 4$.

Logistic Regression

The final classification algorithm we'll work with here is Logistic Regression. As its name implies, this technique makes use of the Logistic Function given by $$\sigma (t)={\frac {e^{t}}{e^{t}+1}}={\frac {1}{1+e^{-t}}}$$ which forms a bit of an S-shaped curve. The intuition behind the this model is that, given a set of predictor variables, we want our response to collapse to a binary output, 0 or 1. As it turns out, this technique, although one of the oldest, tends to provide good results in a variety of problems, and is lightweight.

1 - Create a set of 100 random datapoints separated by the line $f(x) = x + \epsilon$, where $\epsilon$ is a noise term, classifying points as lying above or below the curve. Fit a Logistic Regression model to your data, report the accuracy, and plot the decision boundary. Comment on your results.



In [31]:

    
# generate data
np.random.seed(459)
a = -3
b = 3
# as before in solution linspace is used
x1_lr = (b - a)*np.random.rand(100) + a
a = -0.5
b = 0.5
noise = (b - a)*np.random.rand(100) + a
x2_lr = x1_lr + noise

X_lr = np.vstack((x1_lr, x2_lr)).T
y_lr = np.array((x2_lr >= x1_lr), dtype=int)



In [32]:

    
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_lr, y_lr)
lr.score(X_lr, y_lr)









    Out[32]:





0.93000000000000005



In [33]:

    
plot_decision_boundary(X_lr, y_lr, lr, 0.002)

2 - Repeat part (1), but this time having $f(x) = sin(x) + \epsilon$.



In [34]:

    
# generate data
np.random.seed(459)
a = -3
b = 3
x1_lr = (b - a)*np.random.rand(100) + a
a = -0.5
b = 0.5
noise = (b - a)*np.random.rand(100) + a
x2_lr = np.sin(x1_lr) + noise

X_lr = np.vstack((x1_lr, x2_lr)).T
y_lr = np.array((x2_lr >= np.sin(x1_lr)), dtype=int)



In [35]:

    
lr = LogisticRegression()
lr.fit(X_lr, y_lr)
lr.score(X_lr, y_lr)









    Out[35]:





0.77000000000000002



In [36]:

    
plot_decision_boundary(X_lr, y_lr, lr, 0.002)

3 - Head over to the Machine Learning Repository, download the Chronic_Kidney_Disease Data Set, put it into a data frame, drop all categorical variables, and split into training and testing sets. Be sure to familiarize yourself with the data before proceeding.



In [37]:

    
kidney = pd.read_csv('chronic_kidney_disease_full.arff', header=None)
kidney.columns = ['age','bp','sg','al','su','rbc','pc','pcc','ba','bgr','bu','sc','sod',
                  'pot','hemo','pcv','wc','rc','htn','dm','cad','appet','pe','ane','class']
object_cols = [col for col in kidney.columns if kidney[col].dtype == object and col != 'class']
kidney.drop(object_cols, axis=1, inplace=True)
# solution: .drop(data.dtypes[data.dtypes == 'object'].index[:-1], axis=1)
kidney.dropna(inplace=True)
kidney.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 228 entries, 3 to 399
Data columns (total 12 columns):
age      228 non-null float64
bp       228 non-null float64
sg       228 non-null float64
al       228 non-null float64
su       228 non-null float64
bgr      228 non-null float64
bu       228 non-null float64
sc       228 non-null float64
sod      228 non-null float64
pot      228 non-null float64
hemo     228 non-null float64
class    228 non-null object
dtypes: float64(11), object(1)
memory usage: 23.2+ KB



In [38]:

    
kidney.head()



In [39]:

    
kidney.describe().T



In [40]:

    
kidney.loc[kidney['class'] == 'ckd'].describe().T



In [41]:

    
kidney.loc[kidney['class'] == 'notckd'].describe().T



In [42]:

    
from sklearn.preprocessing import StandardScaler
# create matrices and setting classes to 0, 1
X_kidney = kidney.iloc[:, :-1].values
y_kidney = np.array([1 if cl == 'ckd' else 0 for cl in kidney.iloc[:, -1].values])
# train-test split
X_kidney_train, X_kidney_test, y_kidney_train, y_kidney_test = train_test_split(X_kidney, y_kidney, test_size=0.3, random_state=75)
# scaling
sc = StandardScaler()
sc.fit_transform(X_kidney_train)
X_kidney_train_std = sc.transform(X_kidney_train)
X_kidney_test_std = sc.transform(X_kidney_test)

4 - Fit a Logistic Regression model to the data, report the training and testing accuracies, and comment on your results.



In [43]:

    
lr_kidney = LogisticRegression()
lr_kidney.fit(X_kidney_train_std, y_kidney_train)
print('Train accuracy: {:.5f}'.format(lr_kidney.score(X_kidney_train_std, y_kidney_train)))
print('Test accuracy: {:.5f}'.format(lr_kidney.score(X_kidney_test_std, y_kidney_test)))









    



Train accuracy: 0.99371
Test accuracy: 0.98551

The model does a pretty good job classifying the patients with accuracies over 98.5% both in training and test set.

5 - Fit a series of Logistic Regression models for different values of the C hyperparameter of differing orders of magnitude, reporting the training and testing accuracies of each. Comment on your results.



In [44]:

    
magnitudes = [10**i for i in range(-5, 6)]
for C in magnitudes:
    lr_kidney.set_params(C=C)
    lr_kidney.fit(X_kidney_train_std, y_kidney_train)
    print('Train accuracy for C = {}: {:.5f}'.format(C, lr_kidney.score(X_kidney_train_std, y_kidney_train)))
    print('Test accuracy for C = {}: {:.5f}'.format(C, lr_kidney.score(X_kidney_test_std, y_kidney_test)))









    



Train accuracy for C = 1e-05: 0.96226
Test accuracy for C = 1e-05: 0.94203
Train accuracy for C = 0.0001: 0.96226
Test accuracy for C = 0.0001: 0.94203
Train accuracy for C = 0.001: 0.96226
Test accuracy for C = 0.001: 0.94203
Train accuracy for C = 0.01: 0.96226
Test accuracy for C = 0.01: 0.95652
Train accuracy for C = 0.1: 0.96855
Test accuracy for C = 0.1: 0.97101
Train accuracy for C = 1: 0.99371
Test accuracy for C = 1: 0.98551
Train accuracy for C = 10: 1.00000
Test accuracy for C = 10: 0.98551
Train accuracy for C = 100: 1.00000
Test accuracy for C = 100: 0.98551
Train accuracy for C = 1000: 1.00000
Test accuracy for C = 1000: 0.97101
Train accuracy for C = 10000: 1.00000
Test accuracy for C = 10000: 0.97101
Train accuracy for C = 100000: 1.00000
Test accuracy for C = 100000: 0.97101

We can see that lowering the regularization the model fits perfectly the training set but it doesn't perform that well on the test set. In general the performances are very good for all the parameters, the best choice is $C = 1$, which is also the default.

6 - Experiment with different parameter settings to see if you can improve classification. Be sure to report the training and testing accuracies and comment on your results.



In [45]:

    
cols = ['C', 'class_weight', 'penalty', 'train_score', 'test_score']
lrparams = pd.DataFrame(columns=cols)
magnitudes = [10**i for i in range(-5, 6)]
for C in magnitudes:
    # set params
    lr_kidney.set_params(C=C)
    for class_weight in [None, 'balanced']:
        # set params
        lr_kidney.set_params(class_weight=class_weight)
        for penalty in ['l2', 'l1']:
            # set params
            lr_kidney.set_params(penalty=penalty)
            # fit model
            lr_kidney.fit(X_kidney_train_std, y_kidney_train)
            # get accuracy
            lrparams = lrparams.append(pd.DataFrame([[C,
                                                      class_weight,
                                                      penalty,
                                                      lr_kidney.score(X_kidney_train_std, y_kidney_train),
                                                      lr_kidney.score(X_kidney_test_std, y_kidney_test)]],
                                                    columns=cols))



In [46]:

    
lrparams.sort_values(by='test_score', ascending=False)









    Out[46]:






  
    
      
      C
      class_weight
      penalty
      train_score
      test_score
    
  
  
    
      0
      1.00000
      balanced
      l2
      1.000000
      0.985507
    
    
      0
      10.00000
      None
      l2
      1.000000
      0.985507
    
    
      0
      1.00000
      None
      l2
      0.993711
      0.985507
    
    
      0
      10.00000
      balanced
      l2
      1.000000
      0.985507
    
    
      0
      0.10000
      balanced
      l2
      0.968553
      0.985507
    
    
      0
      100.00000
      None
      l2
      1.000000
      0.985507
    
    
      0
      100.00000
      balanced
      l2
      1.000000
      0.985507
    
    
      0
      10.00000
      balanced
      l1
      1.000000
      0.971014
    
    
      0
      100.00000
      None
      l1
      1.000000
      0.971014
    
    
      0
      100.00000
      balanced
      l1
      1.000000
      0.971014
    
    
      0
      1000.00000
      None
      l2
      1.000000
      0.971014
    
    
      0
      1000.00000
      None
      l1
      1.000000
      0.971014
    
    
      0
      1000.00000
      balanced
      l2
      1.000000
      0.971014
    
    
      0
      1000.00000
      balanced
      l1
      1.000000
      0.971014
    
    
      0
      10000.00000
      None
      l2
      1.000000
      0.971014
    
    
      0
      10000.00000
      None
      l1
      1.000000
      0.971014
    
    
      0
      10000.00000
      balanced
      l2
      1.000000
      0.971014
    
    
      0
      10000.00000
      balanced
      l1
      1.000000
      0.971014
    
    
      0
      100000.00000
      None
      l2
      1.000000
      0.971014
    
    
      0
      100000.00000
      None
      l1
      1.000000
      0.971014
    
    
      0
      100000.00000
      balanced
      l2
      1.000000
      0.971014
    
    
      0
      10.00000
      None
      l1
      1.000000
      0.971014
    
    
      0
      100000.00000
      balanced
      l1
      1.000000
      0.971014
    
    
      0
      0.10000
      None
      l2
      0.968553
      0.971014
    
    
      0
      0.01000
      balanced
      l2
      0.968553
      0.971014
    
    
      0
      0.01000
      None
      l2
      0.962264
      0.956522
    
    
      0
      0.00001
      balanced
      l2
      0.968553
      0.956522
    
    
      0
      0.00010
      balanced
      l2
      0.968553
      0.956522
    
    
      0
      0.00100
      balanced
      l2
      0.968553
      0.956522
    
    
      0
      1.00000
      balanced
      l1
      1.000000
      0.942029
    
    
      0
      0.00010
      None
      l2
      0.962264
      0.942029
    
    
      0
      0.00100
      None
      l2
      0.962264
      0.942029
    
    
      0
      0.00001
      None
      l2
      0.962264
      0.942029
    
    
      0
      1.00000
      None
      l1
      1.000000
      0.942029
    
    
      0
      0.10000
      None
      l1
      0.981132
      0.927536
    
    
      0
      0.10000
      balanced
      l1
      0.981132
      0.927536
    
    
      0
      0.00001
      None
      l1
      0.547170
      0.579710
    
    
      0
      0.01000
      None
      l1
      0.547170
      0.579710
    
    
      0
      0.00100
      None
      l1
      0.547170
      0.579710
    
    
      0
      0.01000
      balanced
      l1
      0.547170
      0.579710
    
    
      0
      0.00010
      balanced
      l1
      0.547170
      0.579710
    
    
      0
      0.00010
      None
      l1
      0.547170
      0.579710
    
    
      0
      0.00001
      balanced
      l1
      0.547170
      0.579710
    
    
      0
      0.00100
      balanced
      l1
      0.547170
      0.579710

In general l2 norms work better than l1, we have a few other optimal settings but they are tied to the first one in test set accuracy.

Comparing Classifiers

As you may have suspected while working through the previous exercises, different algorithms are more suited to different kinds of problems which is what we will explore here.

1 - Create a set of 100 datapoints which are linearly separable, classifying the data as being above or below the curve. Fit a Naive Bayes, KNN, and Logistic Regression model to your data, reporting the accuracies and plotting the decision boundaries. Comment on your results.



In [47]:

    
# generate data
np.random.seed(459)
a = -3
b = 3
x1_linsep = (b - a)*np.random.rand(100) + a
a = -0.5
b = 0.5
noise = (b - a)*np.random.rand(100) + a
x2_linsep = x1_linsep + noise

X_linsep = np.vstack((x1_linsep, x2_linsep)).T
y_linsep = np.array((x2_linsep >= x1_linsep), dtype=int)



In [48]:

    
names = ['Gaussian Naive Bayes', 'K-Neighbours Classifier', 'Logistic Regression']
models = [GaussianNB(), KNeighborsClassifier(), LogisticRegression()]
for name, model in zip(names, models):
    # fit model
    model.fit(X_linsep, y_linsep)
    print(name)
    # get accuracy
    print('Accuracy: {:.5f}'.format(model.score(X_linsep, y_linsep)))
    # plot boundaries
    fig, ax = plt.subplots(1, 1)
    ax.set_title(name)
    plot_decision_boundary(X_linsep, y_linsep, model, 0.002)









    



Gaussian Naive Bayes
Accuracy: 0.60000
K-Neighbours Classifier
Accuracy: 0.90000
Logistic Regression
Accuracy: 0.93000

Logistic regression works best, obviously. Maybe by tweaking the parameters we can get KNN to work as well (it seems to follow to much the noise, some regularization may help), while Naive Bayes seems way out in its predictions.

2 - Create a set of 200 random data points that lie on a unit circle displaced by some noise, categorizing each as being inside or outside the circle. Stated analytically, select $x$ and $y$ such that $x^2 + y^2 = 1 + \epsilon$ where $\epsilon$ is a noise term. Fit a Naive Bayes, KNN, and Logistic Regression model to your data, reporting the accuracies and plotting the decision boundaries. Comment on your results.



In [49]:

    
# generate data
np.random.seed(459)
a = 0
b = 2*np.pi
angles = (b - a)*np.random.rand(200) + a
# generate noise
a = -0.5
b = 0.5
x_noise = (b - a)*np.random.rand(200) + a
y_noise = (b - a)*np.random.rand(200) + a
# fit angles to unit circle using sine and cosine
x1_circle = np.cos(angles) + x_noise
x2_circle = np.sin(angles) + y_noise

X_circle = np.vstack((x1_circle, x2_circle)).T
y_circle = np.array((x1_circle**2 + x2_circle**2 <= 1), dtype=int)



In [50]:

    
names = ['Gaussian Naive Bayes', 'K-Neighbours Classifier', 'Logistic Regression']
models = [GaussianNB(), KNeighborsClassifier(), LogisticRegression()]
for name, model in zip(names, models):
    # fit model
    model.fit(X_circle, y_circle)
    print(name)
    # get accuracy
    print('Accuracy: {:.5f}'.format(model.score(X_circle, y_circle)))
    # plot boundaries
    fig, ax = plt.subplots(1, 1)
    ax.set_title(name)
    plot_decision_boundary(X_circle, y_circle, model, 0.002)









    



Gaussian Naive Bayes
Accuracy: 0.78500
K-Neighbours Classifier
Accuracy: 0.96500
Logistic Regression
Accuracy: 0.60500

I expected better from Naive Bayes, instead the boundary seems a circle but it is too small. The KNN is doing really well while, obviously again, logistic regression is the worst model for this data (and I really don't know what's going on with the plot...).

3 - Using the SMS Spam data set above, fit a KNN and Logistic Regression model to the data and report your training and testing accuracies. Comment on your results.



In [51]:

    
names = ['Multinomial Naive Bayes', 'K-Neighbours Classifier', 'Logistic Regression']
models = [MultinomialNB(), KNeighborsClassifier(), LogisticRegression()]
for name, model in zip(names, models):
    # fit model
    model.fit(X_spam_train, y_spam_train)
    print(name)
    # get accuracy
    print('Train accuracy: {:.5f}'.format(model.score(X_spam_train, y_spam_train)))
    print('Test accuracy: {:.5f}'.format(model.score(X_spam_test, y_spam_test)))









    



Multinomial Naive Bayes
Train accuracy: 0.99282
Test accuracy: 0.98505
K-Neighbours Classifier
Train accuracy: 0.92564
Test accuracy: 0.91627
Logistic Regression
Train accuracy: 0.99821
Test accuracy: 0.98505

As expected multinomial is the best model, but logistic regression does equally good, while KNN, despite doing a pretty good too, takes a real long time to calculate predictions.

4 - Using the Sonar data set above, fit a Naive Bayes and Logistic Regression model to the data and report your training and testing accuracies. Comment on your results.



In [52]:

    
names = ['K-Neighbours Classifier', 'Gaussian Naive Bayes', 'Logistic Regression']
models = [KNeighborsClassifier(), MultinomialNB(), LogisticRegression()]
for name, model in zip(names, models):
    # fit model
    model.fit(X_sonar_train, y_sonar_train)
    print(name)
    # get accuracy
    print('Train accuracy: {:.5f}'.format(model.score(X_sonar_train, y_sonar_train)))
    print('Test accuracy: {:.5f}'.format(model.score(X_sonar_test, y_sonar_test)))









    



K-Neighbours Classifier
Train accuracy: 0.80690
Test accuracy: 0.84127
Gaussian Naive Bayes
Train accuracy: 0.73103
Test accuracy: 0.77778
Logistic Regression
Train accuracy: 0.78621
Test accuracy: 0.82540

Again the best model is the one used before (KNN), but logistic regression again performs quite well

5 - Using the Kidney data set above, fit a Naive Bayes and KNN model to the data and report your training and testing accuracies. Comment on your results. What can you say about your results overall in parts (3), (4), and (5)?



In [53]:

    
names = ['Logistic Regression', 'K-Neighbours Classifier', 'Gaussian Naive Bayes']
models = [LogisticRegression(), KNeighborsClassifier(), MultinomialNB()]
for name, model in zip(names, models):
    # fit model
    model.fit(X_kidney_train, y_kidney_train)
    print(name)
    # get accuracy
    print('Train accuracy: {:.5f}'.format(model.score(X_kidney_train, y_kidney_train)))
    print('Test accuracy: {:.5f}'.format(model.score(X_kidney_test, y_kidney_test)))









    



Logistic Regression
Train accuracy: 0.99371
Test accuracy: 0.97101
K-Neighbours Classifier
Train accuracy: 0.89937
Test accuracy: 0.92754
Gaussian Naive Bayes
Train accuracy: 0.88679
Test accuracy: 0.91304

All models perform quite well (and both the new models have higher test than training accuracy...), but none of them is near to the precision of logistic regression.

Overall we can say that the choice of a right model matters, especially with small datasets such as this; logistic regression seems to be the model that performs best across all datasets in this case.

	count	mean	std	min	25%	50%	75%	max
freq0	208.0	0.029164	0.022991	0.0015	0.013350	0.02280	0.035550	0.1371
freq1	208.0	0.038437	0.032960	0.0006	0.016450	0.03080	0.047950	0.2339
freq2	208.0	0.043832	0.038428	0.0015	0.018950	0.03430	0.057950	0.3059
freq3	208.0	0.053892	0.046528	0.0058	0.024375	0.04405	0.064500	0.4264
freq4	208.0	0.075202	0.055552	0.0067	0.038050	0.06250	0.100275	0.4010
freq5	208.0	0.104570	0.059105	0.0102	0.067025	0.09215	0.134125	0.3823
freq6	208.0	0.121747	0.061788	0.0033	0.080900	0.10695	0.154000	0.3729
freq7	208.0	0.134799	0.085152	0.0055	0.080425	0.11210	0.169600	0.4590
freq8	208.0	0.178003	0.118387	0.0075	0.097025	0.15225	0.233425	0.6828
freq9	208.0	0.208259	0.134416	0.0113	0.111275	0.18240	0.268700	0.7106
freq10	208.0	0.236013	0.132705	0.0289	0.129250	0.22480	0.301650	0.7342
freq11	208.0	0.250221	0.140072	0.0236	0.133475	0.24905	0.331250	0.7060
freq12	208.0	0.273305	0.140962	0.0184	0.166125	0.26395	0.351250	0.7131
freq13	208.0	0.296568	0.164474	0.0273	0.175175	0.28110	0.386175	0.9970
freq14	208.0	0.320201	0.205427	0.0031	0.164625	0.28170	0.452925	1.0000
freq15	208.0	0.378487	0.232650	0.0162	0.196300	0.30470	0.535725	0.9988
freq16	208.0	0.415983	0.263677	0.0349	0.205850	0.30840	0.659425	1.0000
freq17	208.0	0.452318	0.261529	0.0375	0.242075	0.36830	0.679050	1.0000
freq18	208.0	0.504812	0.257988	0.0494	0.299075	0.43495	0.731400	1.0000
freq19	208.0	0.563047	0.262653	0.0656	0.350625	0.54250	0.809325	1.0000
freq20	208.0	0.609060	0.257818	0.0512	0.399725	0.61770	0.816975	1.0000
freq21	208.0	0.624275	0.255883	0.0219	0.406925	0.66490	0.831975	1.0000
freq22	208.0	0.646975	0.250175	0.0563	0.450225	0.69970	0.848575	1.0000
freq23	208.0	0.672654	0.239116	0.0239	0.540725	0.69850	0.872175	1.0000
freq24	208.0	0.675424	0.244926	0.0240	0.525800	0.72110	0.873725	1.0000
freq25	208.0	0.699866	0.237228	0.0921	0.544175	0.75450	0.893800	1.0000
freq26	208.0	0.702155	0.245657	0.0481	0.531900	0.74560	0.917100	1.0000
freq27	208.0	0.694024	0.237189	0.0284	0.534775	0.73190	0.900275	1.0000
freq28	208.0	0.642074	0.240250	0.0144	0.463700	0.68080	0.852125	1.0000
freq29	208.0	0.580928	0.220749	0.0613	0.411400	0.60715	0.735175	1.0000
freq30	208.0	0.504475	0.213992	0.0482	0.345550	0.49035	0.641950	0.9657
freq31	208.0	0.439040	0.213237	0.0404	0.281400	0.42960	0.580300	0.9306
freq32	208.0	0.417220	0.206513	0.0477	0.257875	0.39120	0.556125	1.0000
freq33	208.0	0.403233	0.231242	0.0212	0.217575	0.35105	0.596125	0.9647
freq34	208.0	0.392571	0.259132	0.0223	0.179375	0.31275	0.593350	1.0000
freq35	208.0	0.384848	0.264121	0.0080	0.154350	0.32115	0.556525	1.0000
freq36	208.0	0.363807	0.239912	0.0351	0.160100	0.30630	0.518900	0.9497
freq37	208.0	0.339657	0.212973	0.0383	0.174275	0.31270	0.440550	1.0000
freq38	208.0	0.325800	0.199075	0.0371	0.173975	0.28350	0.434900	0.9857
freq39	208.0	0.311207	0.178662	0.0117	0.186450	0.27805	0.424350	0.9297
freq40	208.0	0.289252	0.171111	0.0360	0.163100	0.25950	0.387525	0.8995
freq41	208.0	0.278293	0.168728	0.0056	0.158900	0.24510	0.384250	0.8246
freq42	208.0	0.246542	0.138993	0.0000	0.155200	0.22255	0.324525	0.7733
freq43	208.0	0.214075	0.133291	0.0000	0.126875	0.17770	0.271750	0.7762
freq44	208.0	0.197232	0.151628	0.0000	0.094475	0.14800	0.231550	0.7034
freq45	208.0	0.160631	0.133938	0.0000	0.068550	0.12135	0.200375	0.7292
freq46	208.0	0.122453	0.086953	0.0000	0.064250	0.10165	0.154425	0.5522
freq47	208.0	0.091424	0.062417	0.0000	0.045125	0.07810	0.120100	0.3339
freq48	208.0	0.051929	0.035954	0.0000	0.026350	0.04470	0.068525	0.1981
freq49	208.0	0.020424	0.013665	0.0000	0.011550	0.01790	0.025275	0.0825
freq50	208.0	0.016069	0.012008	0.0000	0.008425	0.01390	0.020825	0.1004
freq51	208.0	0.013420	0.009634	0.0008	0.007275	0.01140	0.016725	0.0709
freq52	208.0	0.010709	0.007060	0.0005	0.005075	0.00955	0.014900	0.0390
freq53	208.0	0.010941	0.007301	0.0010	0.005375	0.00930	0.014500	0.0352
freq54	208.0	0.009290	0.007088	0.0006	0.004150	0.00750	0.012100	0.0447
freq55	208.0	0.008222	0.005736	0.0004	0.004400	0.00685	0.010575	0.0394
freq56	208.0	0.007820	0.005785	0.0003	0.003700	0.00595	0.010425	0.0355
freq57	208.0	0.007949	0.006470	0.0003	0.003600	0.00580	0.010350	0.0440
freq58	208.0	0.007941	0.006181	0.0001	0.003675	0.00640	0.010325	0.0364
freq59	208.0	0.006507	0.005031	0.0006	0.003100	0.00530	0.008525	0.0439

	count	mean	std	min	25%	50%	75%	max
freq0	97.0	0.022498	0.014687	0.0025	0.0123	0.0201	0.0291	0.0856
freq1	97.0	0.030303	0.024011	0.0006	0.0132	0.0242	0.0433	0.1632
freq2	97.0	0.035951	0.029101	0.0024	0.0145	0.0288	0.0501	0.1636
freq3	97.0	0.041447	0.031172	0.0058	0.0211	0.0350	0.0494	0.1732
freq4	97.0	0.062028	0.047217	0.0067	0.0305	0.0476	0.0794	0.2565
freq5	97.0	0.096224	0.065025	0.0102	0.0591	0.0792	0.1164	0.3823
freq6	97.0	0.114180	0.065155	0.0033	0.0731	0.1015	0.1422	0.3729
freq7	97.0	0.117596	0.079772	0.0057	0.0663	0.0973	0.1451	0.4110
freq8	97.0	0.137392	0.099887	0.0075	0.0743	0.1054	0.1598	0.5598
freq9	97.0	0.159325	0.113151	0.0242	0.0860	0.1264	0.1908	0.6194
freq10	97.0	0.174713	0.113363	0.0289	0.0937	0.1475	0.2366	0.6333
freq11	97.0	0.191589	0.134671	0.0259	0.0909	0.1518	0.2565	0.7060
freq12	97.0	0.226249	0.138082	0.0184	0.1254	0.1975	0.3002	0.6919
freq13	97.0	0.268963	0.166283	0.0420	0.1272	0.2573	0.3598	0.8198
freq14	97.0	0.307636	0.219055	0.0031	0.1424	0.2333	0.4767	1.0000
freq15	97.0	0.375611	0.255401	0.0162	0.1732	0.2652	0.5440	0.9988
freq16	97.0	0.417100	0.289979	0.0349	0.1929	0.2933	0.6950	1.0000
freq17	97.0	0.448240	0.269781	0.0375	0.2203	0.3655	0.6790	1.0000
freq18	97.0	0.466762	0.258432	0.0494	0.2915	0.3880	0.7060	1.0000
freq19	97.0	0.500229	0.259433	0.0656	0.3226	0.4529	0.7728	1.0000
freq20	97.0	0.542270	0.248753	0.1268	0.3542	0.5212	0.7561	1.0000
freq21	97.0	0.569290	0.260639	0.0219	0.3715	0.5991	0.7813	1.0000
freq22	97.0	0.612959	0.243941	0.0667	0.4290	0.6414	0.8225	1.0000
freq23	97.0	0.653761	0.233179	0.1242	0.5386	0.6741	0.8553	1.0000
freq24	97.0	0.668809	0.250009	0.0240	0.5331	0.7272	0.8658	1.0000
freq25	97.0	0.692762	0.238802	0.1653	0.5647	0.7511	0.8901	1.0000
freq26	97.0	0.687737	0.220013	0.0874	0.5593	0.6995	0.8684	1.0000
freq27	97.0	0.673145	0.205935	0.0832	0.5429	0.6897	0.8533	1.0000
freq28	97.0	0.632680	0.234491	0.0716	0.4739	0.6301	0.8339	1.0000
freq29	97.0	0.579934	0.233361	0.0613	0.4084	0.6153	0.7679	1.0000
freq30	97.0	0.529762	0.201619	0.1307	0.3940	0.5509	0.6499	0.9546
freq31	97.0	0.451619	0.213180	0.0404	0.2628	0.4611	0.6177	0.8940
freq32	97.0	0.440841	0.217085	0.0507	0.2797	0.4031	0.6096	0.9708
freq33	97.0	0.445680	0.251312	0.0306	0.2298	0.4006	0.6234	0.9647
freq34	97.0	0.455530	0.261250	0.0244	0.2487	0.3953	0.6275	1.0000
freq35	97.0	0.460710	0.262311	0.0352	0.2276	0.4676	0.6680	1.0000
freq36	97.0	0.417330	0.242986	0.0379	0.2149	0.4260	0.5789	0.9497
freq37	97.0	0.348868	0.223805	0.0383	0.1866	0.3104	0.4576	1.0000
freq38	97.0	0.313709	0.216223	0.0371	0.1583	0.2685	0.4427	0.9857
freq39	97.0	0.318057	0.196823	0.0117	0.1880	0.2744	0.4057	0.9297
freq40	97.0	0.285428	0.177206	0.0488	0.1659	0.2509	0.3548	0.8995
freq41	97.0	0.252338	0.165874	0.0056	0.1449	0.2200	0.3041	0.7911
freq42	97.0	0.211822	0.130301	0.0000	0.1272	0.1882	0.2728	0.7733
freq43	97.0	0.175132	0.107422	0.0000	0.0976	0.1606	0.2222	0.7762
freq44	97.0	0.142312	0.095663	0.0000	0.0831	0.1304	0.1683	0.6009
freq45	97.0	0.116949	0.093840	0.0000	0.0528	0.0861	0.1643	0.5507
freq46	97.0	0.094458	0.067765	0.0000	0.0479	0.0804	0.1190	0.4331
freq47	97.0	0.069488	0.048244	0.0000	0.0387	0.0561	0.0919	0.2905
freq48	97.0	0.038449	0.030441	0.0000	0.0176	0.0325	0.0497	0.1981
freq49	97.0	0.017796	0.012557	0.0000	0.0093	0.0155	0.0224	0.0779
freq50	97.0	0.012311	0.008632	0.0000	0.0061	0.0107	0.0156	0.0426
freq51	97.0	0.010453	0.007109	0.0008	0.0052	0.0088	0.0141	0.0386
freq52	97.0	0.009640	0.006174	0.0010	0.0045	0.0081	0.0129	0.0265
freq53	97.0	0.009518	0.005389	0.0013	0.0052	0.0088	0.0130	0.0271
freq54	97.0	0.008567	0.005137	0.0006	0.0050	0.0077	0.0114	0.0233
freq55	97.0	0.007430	0.004773	0.0004	0.0044	0.0065	0.0097	0.0244
freq56	97.0	0.007814	0.005702	0.0003	0.0037	0.0061	0.0101	0.0316
freq57	97.0	0.006677	0.004841	0.0003	0.0031	0.0052	0.0092	0.0199
freq58	97.0	0.007078	0.005114	0.0006	0.0039	0.0058	0.0089	0.0294
freq59	97.0	0.006024	0.003669	0.0011	0.0030	0.0054	0.0078	0.0180

	count	mean	std	min	25%	50%	75%	max
freq0	111.0	0.034989	0.027074	0.0015	0.01795	0.0269	0.04185	0.1371
freq1	111.0	0.045544	0.037840	0.0017	0.01945	0.0353	0.05925	0.2339
freq2	111.0	0.050720	0.044014	0.0015	0.02525	0.0386	0.06250	0.3059
freq3	111.0	0.064768	0.054497	0.0061	0.03110	0.0547	0.08155	0.4264
freq4	111.0	0.086715	0.059790	0.0076	0.04590	0.0748	0.11170	0.4010
freq5	111.0	0.111864	0.052620	0.0116	0.07695	0.1091	0.14630	0.2770
freq6	111.0	0.128359	0.058179	0.0130	0.08800	0.1232	0.16160	0.3322
freq7	111.0	0.149832	0.087190	0.0055	0.09805	0.1298	0.18100	0.4590
freq8	111.0	0.213492	0.122237	0.0368	0.13695	0.1864	0.26105	0.6828
freq9	111.0	0.251022	0.137366	0.0113	0.16850	0.2245	0.30185	0.7106
freq10	111.0	0.289581	0.125359	0.0523	0.21900	0.2652	0.34400	0.7342
freq11	111.0	0.301459	0.124149	0.0236	0.23815	0.2880	0.36465	0.5771
freq12	111.0	0.314426	0.130742	0.0616	0.22925	0.2959	0.38260	0.7131
freq13	111.0	0.320692	0.159738	0.0273	0.20135	0.3087	0.39540	0.9970
freq14	111.0	0.331182	0.193066	0.0092	0.18025	0.3052	0.44540	0.9137
freq15	111.0	0.380999	0.211929	0.0422	0.20250	0.3323	0.53300	0.9718
freq16	111.0	0.415007	0.239680	0.0886	0.21685	0.3286	0.60955	1.0000
freq17	111.0	0.455882	0.255275	0.0689	0.24290	0.3715	0.67660	1.0000
freq18	111.0	0.538062	0.254108	0.1151	0.32300	0.5233	0.77540	0.9975
freq19	111.0	0.617941	0.254087	0.0740	0.41200	0.6833	0.83510	1.0000
freq20	111.0	0.667426	0.252410	0.0512	0.50935	0.7420	0.86985	1.0000
freq21	111.0	0.672325	0.242758	0.1127	0.47420	0.7452	0.85150	1.0000
freq22	111.0	0.676701	0.252850	0.0563	0.51510	0.7716	0.87030	1.0000
freq23	111.0	0.689165	0.244037	0.0239	0.55615	0.7298	0.89060	1.0000
freq24	111.0	0.681204	0.241384	0.0885	0.51655	0.7201	0.88985	1.0000
freq25	111.0	0.706075	0.236752	0.0921	0.53595	0.7595	0.89750	1.0000
freq26	111.0	0.714754	0.266410	0.0481	0.49745	0.8300	0.93850	1.0000
freq27	111.0	0.712269	0.261030	0.0284	0.52850	0.7689	0.94820	1.0000
freq28	111.0	0.650283	0.245937	0.0144	0.45540	0.7171	0.86490	1.0000
freq29	111.0	0.581796	0.210175	0.1671	0.41510	0.6048	0.70235	1.0000
freq30	111.0	0.482378	0.222801	0.0482	0.32335	0.4326	0.60955	0.9657
freq31	111.0	0.428049	0.213642	0.0877	0.28410	0.3939	0.54070	0.9306
freq32	111.0	0.396577	0.195461	0.0477	0.24790	0.3828	0.51985	1.0000
freq33	111.0	0.366140	0.206225	0.0212	0.19930	0.3263	0.50310	0.9536
freq34	111.0	0.337553	0.245456	0.0223	0.14255	0.2667	0.51240	1.0000
freq35	111.0	0.318553	0.248427	0.0080	0.13420	0.2331	0.43345	0.9870
freq36	111.0	0.317034	0.228149	0.0351	0.15580	0.2373	0.42825	0.9386
freq37	111.0	0.331608	0.203718	0.0618	0.17425	0.3156	0.43045	0.9480
freq38	111.0	0.336365	0.183123	0.0436	0.18185	0.3129	0.43000	0.8827
freq39	111.0	0.305221	0.161806	0.0227	0.18150	0.2900	0.42470	0.8116
freq40	111.0	0.292594	0.166338	0.0360	0.16005	0.2674	0.39915	0.7664
freq41	111.0	0.300975	0.168666	0.0328	0.17030	0.2778	0.41990	0.8246
freq42	111.0	0.276883	0.139806	0.0308	0.16770	0.2563	0.37395	0.7517
freq43	111.0	0.248106	0.144418	0.0255	0.14170	0.2037	0.32945	0.5772
freq44	111.0	0.245225	0.174053	0.0298	0.11640	0.1878	0.40425	0.7034
freq45	111.0	0.198804	0.151395	0.0080	0.09485	0.1651	0.24210	0.7292
freq46	111.0	0.146917	0.094474	0.0237	0.08010	0.1173	0.18290	0.5522
freq47	111.0	0.110594	0.067118	0.0041	0.06175	0.1005	0.13715	0.3339
freq48	111.0	0.063708	0.036382	0.0098	0.03665	0.0558	0.08710	0.1794
freq49	111.0	0.022721	0.014226	0.0044	0.01340	0.0194	0.02735	0.0825
freq50	111.0	0.019352	0.013528	0.0009	0.01250	0.0171	0.02385	0.1004
freq51	111.0	0.016014	0.010770	0.0013	0.00925	0.0132	0.01900	0.0709
freq52	111.0	0.011643	0.007658	0.0005	0.00595	0.0101	0.01645	0.0390
freq53	111.0	0.012185	0.008463	0.0010	0.00560	0.0096	0.01605	0.0352
freq54	111.0	0.009923	0.008404	0.0012	0.00380	0.0072	0.01355	0.0447
freq55	111.0	0.008914	0.006403	0.0007	0.00445	0.0074	0.01200	0.0394
freq56	111.0	0.007825	0.005884	0.0009	0.00370	0.0057	0.01065	0.0355
freq57	111.0	0.009060	0.007460	0.0006	0.00420	0.0070	0.01145	0.0440
freq58	111.0	0.008695	0.006917	0.0001	0.00355	0.0070	0.01135	0.0364
freq59	111.0	0.006930	0.005958	0.0006	0.00310	0.0053	0.00920	0.0439

	age	bp	sg	al	bgr	bu	sc	sod	pot	hemo	class
3	48.0	70.0	1.005	4.0	117.0	56.0	3.8	111.0	2.5	11.2	ckd
5	60.0	90.0	1.015	3.0	74.0	25.0	1.1	142.0	3.2	12.2	ckd
6	68.0	70.0	1.010	0.0	100.0	54.0	24.0	104.0	4.0	12.4	ckd
9	53.0	90.0	1.020	2.0	70.0	107.0	7.2	114.0	3.7	9.5	ckd
11	63.0	70.0	1.010	3.0	380.0	60.0	2.7	131.0	4.2	10.8	ckd

	count	mean	std	min	25%	50%	75%	max
age	228.0	51.688596	15.866582	6.000	42.000	54.50	63.000	90.000
bp	228.0	75.131579	11.476091	50.000	70.000	80.00	80.000	110.000
sg	228.0	1.018684	0.005712	1.005	1.015	1.02	1.025	1.025
al	228.0	0.890351	1.367107	0.000	0.000	0.00	2.000	4.000
su	228.0	0.377193	0.974479	0.000	0.000	0.00	0.000	5.000
bgr	228.0	140.583333	76.804730	22.000	96.750	117.00	140.000	490.000
bu	228.0	53.978509	44.977444	1.500	26.000	41.00	54.250	309.000
sc	228.0	2.500877	3.402137	0.400	0.800	1.20	2.625	24.000
sod	228.0	138.412281	7.053367	104.000	135.000	139.00	142.000	150.000
pot	228.0	4.538596	2.920488	2.500	3.800	4.40	4.900	47.000
hemo	228.0	13.110088	2.851746	3.100	10.975	13.60	15.200	17.800

alpha	fit_prior	train_accuracy	test_accuracy
0.1	True	0.996923	0.980861
0.2	True	0.996667	0.982057
0.3	True	0.995385	0.982656
0.4	True	0.995385	0.983254
0.5	True	0.994615	0.983852
1.0	True	0.994615	0.982656
0.6	True	0.994359	0.983254
0.7	True	0.994103	0.982656
0.9	True	0.993846	0.981459
0.8	True	0.993846	0.982057
0.1	False	0.993333	0.971890
0.2	False	0.991282	0.972488
0.3	False	0.990769	0.970096
0.4	False	0.990513	0.968301
0.5	False	0.990000	0.967703
0.6	False	0.987692	0.968301
0.8	False	0.987692	0.967703
0.7	False	0.987179	0.968301
0.9	False	0.986667	0.968301
1.0	False	0.986410	0.968301
0.0	False	0.865128	0.867823
0.0	True	0.865128	0.867823

	freq0	freq1	freq2	freq3	freq4	freq5	freq6	freq7	freq8	freq9	...	freq51	freq52	freq53	freq54	freq55	freq56	freq57	freq58	freq59	class
0	0.0200	0.0371	0.0428	0.0207	0.0954	0.0986	0.1539	0.1601	0.3109	0.2111	...	0.0027	0.0065	0.0159	0.0072	0.0167	0.0180	0.0084	0.0090	0.0032	R
1	0.0453	0.0523	0.0843	0.0689	0.1183	0.2583	0.2156	0.3481	0.3337	0.2872	...	0.0084	0.0089	0.0048	0.0094	0.0191	0.0140	0.0049	0.0052	0.0044	R
2	0.0262	0.0582	0.1099	0.1083	0.0974	0.2280	0.2431	0.3771	0.5598	0.6194	...	0.0232	0.0166	0.0095	0.0180	0.0244	0.0316	0.0164	0.0095	0.0078	R
3	0.0100	0.0171	0.0623	0.0205	0.0205	0.0368	0.1098	0.1276	0.0598	0.1264	...	0.0121	0.0036	0.0150	0.0085	0.0073	0.0050	0.0044	0.0040	0.0117	R
4	0.0762	0.0666	0.0481	0.0394	0.0590	0.0649	0.1209	0.2467	0.3564	0.4459	...	0.0031	0.0054	0.0105	0.0110	0.0015	0.0072	0.0048	0.0107	0.0094	R

	count	mean	std	min	25%	50%	75%	max
age	101.0	57.673267	14.424361	6.000	49.00	60.000	67.000	90.000
bp	101.0	79.801980	12.882562	50.000	70.00	80.000	90.000	110.000
sg	101.0	1.013861	0.004893	1.005	1.01	1.015	1.015	1.025
al	101.0	2.009901	1.403532	0.000	1.00	2.000	3.000	4.000
su	101.0	0.851485	1.322015	0.000	0.00	0.000	2.000	5.000
bgr	101.0	182.386139	98.828029	22.000	107.00	157.000	242.000	490.000
bu	101.0	80.387129	56.179362	1.500	40.00	64.000	107.000	309.000
sc	101.0	4.549505	4.310490	0.600	1.70	2.800	5.600	24.000
sod	101.0	134.207921	7.276423	104.000	132.00	136.000	139.000	145.000
pot	101.0	4.805941	4.333909	2.500	3.90	4.200	4.900	47.000
hemo	101.0	10.532673	2.071092	3.100	9.40	10.600	12.000	16.100

	count	mean	std	min	25%	50%	75%	max
age	127.0	46.929134	15.390773	12.00	35.00	47.000	58.000	80.000
bp	127.0	71.417323	8.611953	60.00	60.00	70.000	80.000	80.000
sg	127.0	1.022520	0.002510	1.02	1.02	1.025	1.025	1.025
al	127.0	0.000000	0.000000	0.00	0.00	0.000	0.000	0.000
su	127.0	0.000000	0.000000	0.00	0.00	0.000	0.000	0.000
bgr	127.0	107.338583	19.005104	70.00	93.00	108.000	123.500	140.000
bu	127.0	32.976378	11.664942	10.00	23.00	33.000	44.000	50.000
sc	127.0	0.871654	0.258167	0.40	0.60	0.900	1.100	1.200
sod	127.0	141.755906	4.708514	135.00	138.00	141.000	146.000	150.000
pot	127.0	4.325984	0.596978	3.30	3.70	4.500	4.900	5.000
hemo	127.0	15.159843	1.298917	13.00	14.05	15.000	16.200	17.800

C	class_weight	penalty	train_score	test_score
1.00000	balanced	l2	1.000000	0.985507
10.00000	None	l2	1.000000	0.985507
1.00000	None	l2	0.993711	0.985507
10.00000	balanced	l2	1.000000	0.985507
0.10000	balanced	l2	0.968553	0.985507
100.00000	None	l2	1.000000	0.985507
100.00000	balanced	l2	1.000000	0.985507
10.00000	balanced	l1	1.000000	0.971014
100.00000	None	l1	1.000000	0.971014
100.00000	balanced	l1	1.000000	0.971014
1000.00000	None	l2	1.000000	0.971014
1000.00000	None	l1	1.000000	0.971014
1000.00000	balanced	l2	1.000000	0.971014
1000.00000	balanced	l1	1.000000	0.971014
10000.00000	None	l2	1.000000	0.971014
10000.00000	None	l1	1.000000	0.971014
10000.00000	balanced	l2	1.000000	0.971014
10000.00000	balanced	l1	1.000000	0.971014
100000.00000	None	l2	1.000000	0.971014
100000.00000	None	l1	1.000000	0.971014
100000.00000	balanced	l2	1.000000	0.971014
10.00000	None	l1	1.000000	0.971014
100000.00000	balanced	l1	1.000000	0.971014
0.10000	None	l2	0.968553	0.971014
0.01000	balanced	l2	0.968553	0.971014
0.01000	None	l2	0.962264	0.956522
0.00001	balanced	l2	0.968553	0.956522
0.00010	balanced	l2	0.968553	0.956522
0.00100	balanced	l2	0.968553	0.956522
1.00000	balanced	l1	1.000000	0.942029
0.00010	None	l2	0.962264	0.942029
0.00100	None	l2	0.962264	0.942029
0.00001	None	l2	0.962264	0.942029
1.00000	None	l1	1.000000	0.942029
0.10000	None	l1	0.981132	0.927536
0.10000	balanced	l1	0.981132	0.927536
0.00001	None	l1	0.547170	0.579710
0.01000	None	l1	0.547170	0.579710
0.00100	None	l1	0.547170	0.579710
0.01000	balanced	l1	0.547170	0.579710
0.00010	balanced	l1	0.547170	0.579710
0.00010	None	l1	0.547170	0.579710
0.00001	balanced	l1	0.547170	0.579710
0.00100	balanced	l1	0.547170	0.579710