Classification problems have long been at the heart of applied statistics, but it wasn't until the advent of the computer that their power really began to manifest. Some of the simplest algorithms are Naive Bayes, K-nearest Neighbors, and Logistic Regression. Calling these algorithms "simple" by no means implies that they are ineffective, in fact, due to their simplicity and speed, they are often employed in workflows. In this exercise we'll explore these classifiers and compare the performance against one another.
Naive Bayes classifiers have been studied extensively since the 1950s and 1960s, notably in the domain of text classification, being especially useful for email spam detection. They work by assuming a prior probability distribution in each of the classes, and then update each distribution given the training data using Bayes's Theorem given by $${\displaystyle p(C_{k}\mid \mathbf {x} )={\frac {p(C_{k})\ p(\mathbf {x} \mid C_{k})}{p(\mathbf {x} )}}\,}$$ where $C_{k}$ is the kth class, and $x$ is the training data. The "naive" part of the algorithm comes into play with the assumption that each data point is independent of the previous, leading to the nice property that the joint pobability can be computed as the simple product. New data is then classified by implementing a decision rule, typically the probabiliy of the new data point belonging to each class is computed and the largest is chosen.
1 - Head over to the Machine Learning Repository, download the SMS Spam Collection Data Set, put it into a dataframe, process it as a bag-of-words using word counts, and split into training and test sets. Be sure to familiarize yourself with the data before proceeding.
In [1]:
import pandas as pd
import numpy as np
In [2]:
# read data
spam = pd.read_csv('SMSSpamCollection', sep='\t', header=None, encoding='latin-1')
spam.columns = ['class', 'text']
In [3]:
# create bag-of-words counts
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
wordcount = vec.fit_transform(spam.text)
# create X and y matrices
X_spam = wordcount.toarray()
y_spam = np.array(spam['class'] == 'spam', dtype=int)
In the solution: first split then fitting transform to train and applying to test:
(Also, I don't need to transform y in 1s and 0s array.
In [4]:
# cv = CountVectorizer()
# X_train = cv.fit_transform(X_train)
# X_test = cv.fit_transform(X_test)
In [5]:
# train and test split
from sklearn.model_selection import train_test_split
X_spam_train, X_spam_test, y_spam_train, y_spam_test = train_test_split(X_spam, y_spam, test_size=0.3, random_state = 0)
2 - Which of the available Naive Bayes Classifiers is most appropriate to the data? Choose one and fit it to the data, using the default hyperparameter settings, and report the training and testing accuracies. Comment on your results.
I would use a multinomial classifier, since we are dealing with counts:
In [6]:
from sklearn.naive_bayes import MultinomialNB
NB = MultinomialNB()
NB.fit(X_spam_train, y_spam_train)
Out[6]:
In [7]:
print('Training accuracy: {:.5f}'.format(NB.score(X_spam_train, y_spam_train)))
print('Testing accuracy: {:.5f}'.format(NB.score(X_spam_test, y_spam_test)))
Seems a pretty good accuracy, in and out of sample.
3 - Try a few different settings of the alpha
parameter, printing the training and testing accuracies. Also try switching the fit_prior
parameter on and off. Comment on your results.
In [8]:
# alpha can be greater than 1!
NBparams = pd.DataFrame(columns=['alpha', 'fit_prior', 'train_accuracy', 'test_accuracy'])
for alpha in np.arange(0, 1.1, 0.1):
for fit_prior in [True, False]:
NB.set_params(alpha=alpha, fit_prior=fit_prior)
NB.fit(X_spam_train, y_spam_train)
NBparams = NBparams.append(pd.DataFrame([[alpha,
fit_prior,
NB.score(X_spam_train, y_spam_train),
NB.score(X_spam_test, y_spam_test)]],
columns=['alpha', 'fit_prior', 'train_accuracy', 'test_accuracy']))
In [9]:
NBparams.sort_values(by='train_accuracy', ascending=False)
Out[9]:
In [10]:
NBparams.sort_values(by='test_accuracy', ascending=False)
Out[10]:
The best models all have the fit prior parameter set to True. Also, we see that the best train accuracy is obtained with small $\alpha$, but that doesn't translate to the best test accuracies that are, in general, obtained with $\alpha$ near to 1. The documentation says that "the smoothing priors $\alpha \ge 0$ accounts for features not present in the learning samples and prevents zero probabilities in further computations"; so I would say that small values of $\alpha$ cause some overfitting of the train set while higher values works better because we have a sparse matrix of features.
That being said, the accuracy is high in general.
We can also observe that the models trained with $\alpha = 0$ are by far the worst ones.
4 - Generate a word cloud for the ham
and spam
words.
In [11]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [12]:
from wordcloud import WordCloud
# counting ham words
hamvec = CountVectorizer()
hamcount = hamvec.fit_transform(spam.loc[spam['class'] == 'ham', 'text'].values)
# creating a series containing word and relative count
hamcountdict = {}
for word, count in zip(hamvec.get_feature_names(), np.asarray(hamcount.sum(axis=0)).ravel()):
hamcountdict[word] = count
hamcountseries = pd.Series(hamcountdict)
hamcountseries / hamcountseries.sum()
# counting spam words
spamvec = CountVectorizer()
spamcount = spamvec.fit_transform(spam.loc[spam['class'] == 'spam', 'text'].values)
# creating a series containing word and relative count
spamcountdict = {}
for word, count in zip(spamvec.get_feature_names(), np.asarray(spamcount.sum(axis=0)).ravel()):
spamcountdict[word] = count
spamcountseries = pd.Series(spamcountdict)
spamcountseries / spamcountseries.sum()
# creating two wordclouds from the frequencies calculated on the series
wordcloudham = WordCloud(background_color='white', max_words=len(hamvec.vocabulary_)).generate_from_frequencies(hamcountseries / hamcountseries.sum())
wordcloudspam = WordCloud(background_color='white', max_words=len(spamvec.vocabulary_)).generate_from_frequencies(spamcountseries / spamcountseries.sum())
In [13]:
# # solution:
# import nltk
# from nltk.corpus import stopwordswords
# # Split ham and spam rows
# ham = data[data['class'] == 'ham'].text
# spam = data[data['class'] == 'spam'].text
# # Get counts, removing stopwords
# ham_words = ''
# for row in ham:
# test = row.lower()
# tokens = nltk.word_tokenize(text)
# for words in tokens:
# ham_words = ham_words + words + ' '
# spam_words = ''
# for row in spam:
# test = row.lower()
# tokens = nltk.word_tokenize(text)
# for words in tokens:
# spam_words = spam_words + words + ' '
# # Generate a word cloud image
# ham_wordcloud = WordCloud(width=600, height=400).generate(ham_words)
# spam_wordcloud = WordCloud(width=600, height=400).generate(spam_words)
In [14]:
size = 20
fig = plt.figure(figsize=(size, size))
ax = plt.axes()
ax.set_title('Ham Words', size=40)
ax.imshow(wordcloudham, interpolation='bilinear')
ax.axis('off');
In [15]:
size = 20
fig = plt.figure(figsize=(size, size))
ax = plt.axes()
ax.set_title('Spam Words', size=40)
ax.imshow(wordcloudspam, interpolation='bilinear')
ax.axis('off');
5 - Naive Bayes can also be used for classification of continous data. Generate a set of random data points, $x1$ and $x2$, classifiying each point as being above or below the line f(x) = x
, fit a Naive Bayes model, report the accuracy, and plot the decision boundary. Comment on your results.
In [16]:
# generate data
np.random.seed(687)
x1 = np.random.rand(100)
x2 = np.random.rand(100)
X_cont = np.vstack((x1, x2)).T
y_cont = np.array((x1 >= x2), dtype=int)
In [17]:
from sklearn.naive_bayes import GaussianNB
# fit model
NB_cont = GaussianNB()
NB_cont.fit(X_cont, y_cont)
# get accuracy
NB_cont.score(X_cont, y_cont)
Out[17]:
In [18]:
def plot_decision_boundary(X, y, classifier, resolution):
# plot decision boundaries
cmap = plt.cm.get_cmap('viridis')
markers = ('o', '^', 's', 'x', 'v')
x1_min, x1_max = X[:, 0].min() - 0.1, X[:, 0].max() + 0.1
x2_min, x2_max = X[:, 1].min() - 0.1, X[:, 1].max() + 0.1
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
np.arange(x2_min, x2_max, resolution))
Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
Z = Z.reshape(xx1.shape)
plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
plt.xlim(xx1.min(), xx1.max())
plt.ylim(xx2.min(), xx2.max())
# scatter points
for idx, cl in enumerate(np.unique(y)):
plt.scatter(x=X[y == cl, 0],
y=X[y == cl, 1],
alpha=0.6,
c=np.array(cmap.colors[idx]),
edgecolor='black',
marker=markers[idx],
label='cl'
)
In [19]:
plot_decision_boundary(X_cont, y_cont, NB_cont, 0.002)
Another one of the simplest of classificaiton algorithms is the K-Neareast Neighbors. KNN works by simply classifying a datapoint by its proximity to other datapoints of known category. There are a few different evaluation metrics to choose for "distance" as well as different implementations of the algorithm, but the idea is always the same.
1 - Generate a set of 100 random data points according to the function ${\displaystyle f(x) = N(x\;|\;\mu ,\sigma ^{2}) + \epsilon}$ where $\mu = 0$, $\sigma = 1$, and $\epsilon$ is a noise term. Then classify each point as being above or below the curve, fit a KNN classifier, using the default values, and plot the decision boundary and report the accuracy. Comment on your results.
In [20]:
from scipy.stats import norm
# generate data
np.random.seed(459)
a = -3
b = 3
# in solution it uses a linspace between -3 and 3
x1_KNN = (b - a)*np.random.rand(100) + a
a = -0.5
b = 0.5
noise = (b - a)*np.random.rand(100) + a
x2_KNN = norm.pdf(x1_KNN) + noise
X_KNN = np.vstack((x1_KNN, x2_KNN)).T
y_KNN = np.array((x2_KNN >= norm.pdf(x1_KNN)), dtype=int)
In [21]:
from sklearn.neighbors import KNeighborsClassifier
# fit model
KNN = KNeighborsClassifier()
KNN.fit(X_KNN, y_KNN)
# get accuracy
KNN.score(X_KNN, y_KNN)
Out[21]:
In [22]:
plot_decision_boundary(X_KNN, y_KNN, KNN, 0.002)
2 - Using the data generated in part (1), fit a series of KNN models, adjusting the n_neighbors
parameter, and plot the decision boundary and accuracy for each. Comment on your results.
In [23]:
n_neighbors = np.arange(1, 11)
for n in n_neighbors:
# set params
KNN.set_params(n_neighbors=n)
# fit
KNN.fit(X_KNN, y_KNN)
# get accuracy
print('KNN with {} neighbors accuracy: {:.5f}'.format(n, KNN.score(X_KNN, y_KNN)))
# plot boundaries
fig, ax = plt.subplots(1, 1)
ax.set_title('KNN with {} neighbors'.format(n))
plot_decision_boundary(X_KNN, y_KNN, KNN, 0.002)
The best result is for 1 neighbors, but by looking at the decision boundaries we may get the best generalization for 2 neighbors. We also get good results for $n$ between 2 and 7, while performances decrease for 8 or more neighbors.
3 - Head over to the Machine Learning Repository, download the Connectionist Bench (Sonar, Mines vs. Rocks) Data Set, put it into a data frame, and split into training and testing sets. Be sure to familiarize yourself with the data before proceeding.
In [24]:
sonar = pd.read_csv('sonar.all-data', header=None)
sonar.columns = ['freq'+str(i) for i in range(60)] + ['class']
print(sonar.shape)
sonar.head()
Out[24]:
In [25]:
sonar.describe().T
Out[25]:
In [26]:
sonar.loc[sonar['class'] == 'R'].describe().T
Out[26]:
In [27]:
sonar.loc[sonar['class'] == 'M'].describe().T
Out[27]:
In [28]:
X_sonar = sonar.iloc[:, :-1].values
y_sonar = sonar.iloc[:, -1].values
X_sonar_train, X_sonar_test, y_sonar_train, y_sonar_test = train_test_split(X_sonar, y_sonar, test_size=0.3, random_state = 42)
4 - Fit a KNN classifier using the default settings and report the training and testing accuracies.
In [29]:
KNN_sonar = KNeighborsClassifier()
KNN_sonar.fit(X_sonar_train, y_sonar_train)
print('Train accuracy: {:.5f}'.format(KNN_sonar.score(X_sonar_train, y_sonar_train)))
print('Test accuracy: {:.5f}'.format(KNN_sonar.score(X_sonar_test, y_sonar_test)))
# for random_State=42 test greater than train? mmh, strange... but for random_state=0 it's smaller
5 - Fit a series of KNN classifiers to the data, adjusting the n_neighbors
parameter, reporting the training and testing accuracies for each. Comment on your results.
In [29]:
n_neighbors = np.arange(1, 11)
for n in n_neighbors:
# set params
KNN_sonar.set_params(n_neighbors=n)
# fit
KNN_sonar.fit(X_sonar_train, y_sonar_train)
# get accuracies
print('KNN with {} neighbors train accuracy: {:.5f}'.format(n, KNN_sonar.score(X_sonar_train, y_sonar_train)))
print('KNN with {} neighbors test accuracy: {:.5f}'.format(n, KNN_sonar.score(X_sonar_test, y_sonar_test)))
We have this strange phenomenon (to me at least) that for $n$ between 3 and 5 and $n = 10$ the test accuracy is greater than the train one.
As before, the best performance is for 1 neighbor and all the small $n$ give good accuracy scores.
6 - Repeat part (5), but this time adjusting the metric
parameter. Comment on your results.
In [30]:
n_neighbors = np.arange(1, 11)
for n in n_neighbors:
# set params
metric = 'chebyshev'
KNN_sonar.set_params(n_neighbors=n, metric=metric)
# fit
KNN_sonar.fit(X_sonar_train, y_sonar_train)
# get accuracies
print('KNN with {} neighbors and {} metric train accuracy: {:.5f}'.format(n,
metric,
KNN_sonar.score(X_sonar_train, y_sonar_train)))
print('KNN with {} neighbors and {} metric test accuracy: {:.5f}'.format(n,
metric,
KNN_sonar.score(X_sonar_test, y_sonar_test)))
for p in np.arange(1, 6):
# set params
metric = 'minkowski'
KNN_sonar.set_params(metric=metric, p=p)
# fit
KNN_sonar.fit(X_sonar_train, y_sonar_train)
# get accuracies
print('KNN with {} neighbors and {} {}-metric train accuracy: {:.5f}'.format(n,
metric,
p,
KNN_sonar.score(X_sonar_train, y_sonar_train)))
print('KNN with {} neighbors and {} {}-metric test accuracy: {:.5f}'.format(n,
metric,
p,
KNN_sonar.score(X_sonar_test, y_sonar_test)))
The best scores in general are for Minkovski's metrics with small $n$ and $p$ between 1 and 3. The best test accuracy although is for $n = 1$ and $p = 4$.
The final classification algorithm we'll work with here is Logistic Regression. As its name implies, this technique makes use of the Logistic Function given by $$\sigma (t)={\frac {e^{t}}{e^{t}+1}}={\frac {1}{1+e^{-t}}}$$ which forms a bit of an S-shaped curve. The intuition behind the this model is that, given a set of predictor variables, we want our response to collapse to a binary output, 0
or 1
. As it turns out, this technique, although one of the oldest, tends to provide good results in a variety of problems, and is lightweight.
1 - Create a set of 100 random datapoints separated by the line $f(x) = x + \epsilon$, where $\epsilon$ is a noise term, classifying points as lying above or below the curve. Fit a Logistic Regression model to your data, report the accuracy, and plot the decision boundary. Comment on your results.
In [31]:
# generate data
np.random.seed(459)
a = -3
b = 3
# as before in solution linspace is used
x1_lr = (b - a)*np.random.rand(100) + a
a = -0.5
b = 0.5
noise = (b - a)*np.random.rand(100) + a
x2_lr = x1_lr + noise
X_lr = np.vstack((x1_lr, x2_lr)).T
y_lr = np.array((x2_lr >= x1_lr), dtype=int)
In [32]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_lr, y_lr)
lr.score(X_lr, y_lr)
Out[32]:
In [33]:
plot_decision_boundary(X_lr, y_lr, lr, 0.002)
2 - Repeat part (1), but this time having $f(x) = sin(x) + \epsilon$.
In [34]:
# generate data
np.random.seed(459)
a = -3
b = 3
x1_lr = (b - a)*np.random.rand(100) + a
a = -0.5
b = 0.5
noise = (b - a)*np.random.rand(100) + a
x2_lr = np.sin(x1_lr) + noise
X_lr = np.vstack((x1_lr, x2_lr)).T
y_lr = np.array((x2_lr >= np.sin(x1_lr)), dtype=int)
In [35]:
lr = LogisticRegression()
lr.fit(X_lr, y_lr)
lr.score(X_lr, y_lr)
Out[35]:
In [36]:
plot_decision_boundary(X_lr, y_lr, lr, 0.002)
3 - Head over to the Machine Learning Repository, download the Chronic_Kidney_Disease Data Set, put it into a data frame, drop all categorical variables, and split into training and testing sets. Be sure to familiarize yourself with the data before proceeding.
In [37]:
kidney = pd.read_csv('chronic_kidney_disease_full.arff', header=None)
kidney.columns = ['age','bp','sg','al','su','rbc','pc','pcc','ba','bgr','bu','sc','sod',
'pot','hemo','pcv','wc','rc','htn','dm','cad','appet','pe','ane','class']
object_cols = [col for col in kidney.columns if kidney[col].dtype == object and col != 'class']
kidney.drop(object_cols, axis=1, inplace=True)
# solution: .drop(data.dtypes[data.dtypes == 'object'].index[:-1], axis=1)
kidney.dropna(inplace=True)
kidney.info()
In [38]:
kidney.head()
Out[38]:
In [39]:
kidney.describe().T
Out[39]:
In [40]:
kidney.loc[kidney['class'] == 'ckd'].describe().T
Out[40]:
In [41]:
kidney.loc[kidney['class'] == 'notckd'].describe().T
Out[41]:
In [42]:
from sklearn.preprocessing import StandardScaler
# create matrices and setting classes to 0, 1
X_kidney = kidney.iloc[:, :-1].values
y_kidney = np.array([1 if cl == 'ckd' else 0 for cl in kidney.iloc[:, -1].values])
# train-test split
X_kidney_train, X_kidney_test, y_kidney_train, y_kidney_test = train_test_split(X_kidney, y_kidney, test_size=0.3, random_state=75)
# scaling
sc = StandardScaler()
sc.fit_transform(X_kidney_train)
X_kidney_train_std = sc.transform(X_kidney_train)
X_kidney_test_std = sc.transform(X_kidney_test)
4 - Fit a Logistic Regression model to the data, report the training and testing accuracies, and comment on your results.
In [43]:
lr_kidney = LogisticRegression()
lr_kidney.fit(X_kidney_train_std, y_kidney_train)
print('Train accuracy: {:.5f}'.format(lr_kidney.score(X_kidney_train_std, y_kidney_train)))
print('Test accuracy: {:.5f}'.format(lr_kidney.score(X_kidney_test_std, y_kidney_test)))
The model does a pretty good job classifying the patients with accuracies over 98.5% both in training and test set.
5 - Fit a series of Logistic Regression models for different values of the C
hyperparameter of differing orders of magnitude, reporting the training and testing accuracies of each. Comment on your results.
In [44]:
magnitudes = [10**i for i in range(-5, 6)]
for C in magnitudes:
lr_kidney.set_params(C=C)
lr_kidney.fit(X_kidney_train_std, y_kidney_train)
print('Train accuracy for C = {}: {:.5f}'.format(C, lr_kidney.score(X_kidney_train_std, y_kidney_train)))
print('Test accuracy for C = {}: {:.5f}'.format(C, lr_kidney.score(X_kidney_test_std, y_kidney_test)))
We can see that lowering the regularization the model fits perfectly the training set but it doesn't perform that well on the test set. In general the performances are very good for all the parameters, the best choice is $C = 1$, which is also the default.
6 - Experiment with different parameter settings to see if you can improve classification. Be sure to report the training and testing accuracies and comment on your results.
In [45]:
cols = ['C', 'class_weight', 'penalty', 'train_score', 'test_score']
lrparams = pd.DataFrame(columns=cols)
magnitudes = [10**i for i in range(-5, 6)]
for C in magnitudes:
# set params
lr_kidney.set_params(C=C)
for class_weight in [None, 'balanced']:
# set params
lr_kidney.set_params(class_weight=class_weight)
for penalty in ['l2', 'l1']:
# set params
lr_kidney.set_params(penalty=penalty)
# fit model
lr_kidney.fit(X_kidney_train_std, y_kidney_train)
# get accuracy
lrparams = lrparams.append(pd.DataFrame([[C,
class_weight,
penalty,
lr_kidney.score(X_kidney_train_std, y_kidney_train),
lr_kidney.score(X_kidney_test_std, y_kidney_test)]],
columns=cols))
In [46]:
lrparams.sort_values(by='test_score', ascending=False)
Out[46]:
In general l2 norms work better than l1, we have a few other optimal settings but they are tied to the first one in test set accuracy.
1 - Create a set of 100 datapoints which are linearly separable, classifying the data as being above or below the curve. Fit a Naive Bayes, KNN, and Logistic Regression model to your data, reporting the accuracies and plotting the decision boundaries. Comment on your results.
In [47]:
# generate data
np.random.seed(459)
a = -3
b = 3
x1_linsep = (b - a)*np.random.rand(100) + a
a = -0.5
b = 0.5
noise = (b - a)*np.random.rand(100) + a
x2_linsep = x1_linsep + noise
X_linsep = np.vstack((x1_linsep, x2_linsep)).T
y_linsep = np.array((x2_linsep >= x1_linsep), dtype=int)
In [48]:
names = ['Gaussian Naive Bayes', 'K-Neighbours Classifier', 'Logistic Regression']
models = [GaussianNB(), KNeighborsClassifier(), LogisticRegression()]
for name, model in zip(names, models):
# fit model
model.fit(X_linsep, y_linsep)
print(name)
# get accuracy
print('Accuracy: {:.5f}'.format(model.score(X_linsep, y_linsep)))
# plot boundaries
fig, ax = plt.subplots(1, 1)
ax.set_title(name)
plot_decision_boundary(X_linsep, y_linsep, model, 0.002)
Logistic regression works best, obviously. Maybe by tweaking the parameters we can get KNN to work as well (it seems to follow to much the noise, some regularization may help), while Naive Bayes seems way out in its predictions.
2 - Create a set of 200 random data points that lie on a unit circle displaced by some noise, categorizing each as being inside or outside the circle. Stated analytically, select $x$ and $y$ such that $x^2 + y^2 = 1 + \epsilon$ where $\epsilon$ is a noise term. Fit a Naive Bayes, KNN, and Logistic Regression model to your data, reporting the accuracies and plotting the decision boundaries. Comment on your results.
In [49]:
# generate data
np.random.seed(459)
a = 0
b = 2*np.pi
angles = (b - a)*np.random.rand(200) + a
# generate noise
a = -0.5
b = 0.5
x_noise = (b - a)*np.random.rand(200) + a
y_noise = (b - a)*np.random.rand(200) + a
# fit angles to unit circle using sine and cosine
x1_circle = np.cos(angles) + x_noise
x2_circle = np.sin(angles) + y_noise
X_circle = np.vstack((x1_circle, x2_circle)).T
y_circle = np.array((x1_circle**2 + x2_circle**2 <= 1), dtype=int)
In [50]:
names = ['Gaussian Naive Bayes', 'K-Neighbours Classifier', 'Logistic Regression']
models = [GaussianNB(), KNeighborsClassifier(), LogisticRegression()]
for name, model in zip(names, models):
# fit model
model.fit(X_circle, y_circle)
print(name)
# get accuracy
print('Accuracy: {:.5f}'.format(model.score(X_circle, y_circle)))
# plot boundaries
fig, ax = plt.subplots(1, 1)
ax.set_title(name)
plot_decision_boundary(X_circle, y_circle, model, 0.002)
I expected better from Naive Bayes, instead the boundary seems a circle but it is too small. The KNN is doing really well while, obviously again, logistic regression is the worst model for this data (and I really don't know what's going on with the plot...).
3 - Using the SMS Spam
data set above, fit a KNN and Logistic Regression model to the data and report your training and testing accuracies. Comment on your results.
In [51]:
names = ['Multinomial Naive Bayes', 'K-Neighbours Classifier', 'Logistic Regression']
models = [MultinomialNB(), KNeighborsClassifier(), LogisticRegression()]
for name, model in zip(names, models):
# fit model
model.fit(X_spam_train, y_spam_train)
print(name)
# get accuracy
print('Train accuracy: {:.5f}'.format(model.score(X_spam_train, y_spam_train)))
print('Test accuracy: {:.5f}'.format(model.score(X_spam_test, y_spam_test)))
As expected multinomial is the best model, but logistic regression does equally good, while KNN, despite doing a pretty good too, takes a real long time to calculate predictions.
4 - Using the Sonar
data set above, fit a Naive Bayes and Logistic Regression model to the data and report your training and testing accuracies. Comment on your results.
In [52]:
names = ['K-Neighbours Classifier', 'Gaussian Naive Bayes', 'Logistic Regression']
models = [KNeighborsClassifier(), MultinomialNB(), LogisticRegression()]
for name, model in zip(names, models):
# fit model
model.fit(X_sonar_train, y_sonar_train)
print(name)
# get accuracy
print('Train accuracy: {:.5f}'.format(model.score(X_sonar_train, y_sonar_train)))
print('Test accuracy: {:.5f}'.format(model.score(X_sonar_test, y_sonar_test)))
Again the best model is the one used before (KNN), but logistic regression again performs quite well
5 - Using the Kidney
data set above, fit a Naive Bayes and KNN model to the data and report your training and testing accuracies. Comment on your results. What can you say about your results overall in parts (3), (4), and (5)?
In [53]:
names = ['Logistic Regression', 'K-Neighbours Classifier', 'Gaussian Naive Bayes']
models = [LogisticRegression(), KNeighborsClassifier(), MultinomialNB()]
for name, model in zip(names, models):
# fit model
model.fit(X_kidney_train, y_kidney_train)
print(name)
# get accuracy
print('Train accuracy: {:.5f}'.format(model.score(X_kidney_train, y_kidney_train)))
print('Test accuracy: {:.5f}'.format(model.score(X_kidney_test, y_kidney_test)))
All models perform quite well (and both the new models have higher test than training accuracy...), but none of them is near to the precision of logistic regression.
Overall we can say that the choice of a right model matters, especially with small datasets such as this; logistic regression seems to be the model that performs best across all datasets in this case.