Unlike the regression, Logistic regression is use to Classification problem
Binary classification with logistic regression
Ordinary linear regression assumes that the response variable is normally distributed.
The normal distribution, also known as the Gaussian distribution or bell curve, is a
function that describes the probability that an observation will have a value between
any two real numbers.
Normally distributed data is symmetrical. That is, half of the
values are greater than the mean and the other half of the values are less than the
mean. The mean, median, and mode of normally distributed data are also equal. Many natural phenomena approximately follow normal distributions.
The Bernoulli distribution describes the probability distribution of a random variable that can take the positive case with probability P or the negative case with probability 1-P. If the response variable represents a probability, it must be constrained to the range {0,1}.
Linear regression assumes that a constant change in the value of an explanatory variable results in a constant change in the value of the response variable, an assumption that does not hold if the value of the response variable represents a probability. Generalized linear models remove this assumption by relating a linear combination of the explanatory variables to the response variable using a link function.
In logistic regression, the response variable describes the probability that the outcome is the positive case. If the response variable is equal to or exceeds a discrimination threshold, the positive class is predicted; otherwise, the negative class is predicted. The response variable is modeled as a function of a linear combination of the explanatory variables using the logistic function. Given by the following equation, the logistic function always returns a value between zero and one:
$$ F(t) = \frac{1}{1+e^{-1}} $$For logistic regression, t is equal to a linear combination of explanatory variables,
as follows:
The logit function is the inverse of the logistic function. It links F(x) back to a linear combination of the explanatory variables:
$$ g(x) = ln\frac{F(x)}{1-F(x)} = \beta_0+\beta_x $$
In [27]:
# import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, f1_score, roc_curve, precision_score, auc
%matplotlib inline
In [2]:
# reading sms file
sms = pd.read_csv("data/SMSSpamCollection", delimiter="\t", header=None)
len(sms)
Out[2]:
In [3]:
print(sms.head())
In [4]:
print("No of ham msg : ", sms[sms[0]=='ham'][0].count())
print("No of spam msg : ", sms[sms[0]=='spam'][0].count())
In [5]:
# splitting the data into train n test
X_train_raw, X_test_raw, y_train, y_test = train_test_split(sms[1], sms[0], test_size=0.25)
print(X_train_raw.shape, X_test_raw.shape)
# training the vector
vector = TfidfVectorizer()
# fit and transform the X_train
X_train = vector.fit_transform(X_train_raw)
# transform the X_test
X_test = vector.transform(X_test_raw)
print(X_train.shape, X_test.shape)
# implemeting the classification
classifier = LogisticRegression()
# fitting the data
classifier.fit(X_train, y_train)
# predicting the data
y_pred = classifier.predict(X_test)
print(y_pred.shape)
for i, test in enumerate(X_test_raw[:10]):
print(y_pred[i]," ", test)
Binary classification performance metrics
A variety of metrics exist to evaluate the performance of binary classifiers against trusted labels. The most common metrics are accuracy, precision, recall, F1 measure, and ROC AUC score. All of these measures depend on the concepts of true positives, true negatives, false positives, and false negatives.
A confusion matrix, or contingency table, can be used to visualize true and false positives and negatives. The rows of the matrix are the true classes of the instances, and the columns are the predicted classes of the instances:
In [6]:
y_test = [0 if i=='ham' else 1 for i in y_test]
y_pred = [0 if i=='ham' else 1 for i in y_pred]
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)
plt.matshow(confusion_matrix)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
Accuracy
Accuracy measures a fraction of the classifier's predictions that are correct.
In [10]:
print(accuracy_score(y_test, y_pred))
Note that your accuracy may differ as the training and test sets are assigned randomly. While accuracy measures the overall correctness of the classifier, it does not distinguish between false positive errors and false negative errors. Some applications may be more sensitive to false negatives than false positives, or vice versa. Furthermore, accuracy is not an informative metric if the proportions of the classes are skewed in the population.
For example, a classifier that predicts whether or not credit card transactions are fraudulent may be more sensitive to false negatives than to false positives. To promote customer satisfaction, the credit card company may prefer to risk verifying legitimate transactions than risk ignoring a fraudulent transaction. Because most transactions are legitimate, accuracy is not an appropriate metric for this problem. A classifier that always predicts that transactions are legitimate could have a high accuracy score, but would not be useful. For these reasons, classifiers are often evaluated using two additional measures called precision and recall.
Precision and recall
$$ P = \frac{TP}{TP+FP} $$$$ R = \frac{TP}{TP+FN} $$Individually, precision and recall are seldom informative; they are both incomplete views of a classifier's performance. Both precision and recall can fail to distinguish classifiers that perform well from certain types of classifiers that perform poorly. A trivial classifier could easily achieve a perfect recall score by predicting positive for every instance.
In [15]:
print(precision_score(y_test, y_pred))
print(recall_score(y_test, y_pred))
Calculating the F1 measure
$$ F1 = 2\frac{PR}{P + R} $$The F1 measure penalizes classifiers with imbalanced precision and recall scores, like the trivial classifier that always predicts the positive class. A model with perfect precision and recall scores will achieve an F1 score of one. A model with a perfect precision score and a recall score of zero will achieve an F1 score of zero. As for precision and recall, scikit-learn provides a function to calculate the F1 score for a set of predictions.
In [16]:
print(f1_score(y_test, y_pred))
ROC AUC
A Receiver Operating Characteristic, or ROC curve, visualizes a classifier's performance. Unlike accuracy, the ROC curve is insensitive to data sets with unbalanced class proportions; unlike precision and recall, the ROC curve illustrates the classifier's performance for all values of the discrimination threshold. ROC curves plot the classifier's recall against its fall-out. Fall-out, or the false positive rate, is the number of false positives divided by the total number of negatives. It is calculated using the following formula:
$$ F = \frac{FP}{TN+FP} $$AUC is the area under the ROC curve; it reduces the ROC curve to a single value, which represents the expected performance of the classifier.
In [25]:
false_positive_rate, recall, thresholds = roc_curve(y_test, y_pred)
print(false_positive_rate)
roc_auc = auc(false_positive_rate, recall)
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, recall, 'b', label='AUC = %0.2f' %roc_auc)
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out')
plt.show()
From the ROC AUC plot, it is apparent that our classifier outperforms random guessing; most of the plot area lies under its curve
Hyperparameters are parameters of the model that are not learned. For example, hyperparameters of our logistic regression SMS classifier include the value of the regularization term and thresholds used to remove words that appear too frequently or infrequently. In scikit-learn, hyperparameters are set through the model's constructor. In the previous examples, we did not set any arguments for LogisticRegression(); we used the default values for all of the hyperparameters. These default values are often a good start, but they may not produce the optimal model.
Grid search is a common method to select the hyperparameter values that produce the best model. Grid search takes a set of possible values for each hyperparameter that should be tuned, and evaluates a model trained on each element of the Cartesian product of the sets. That is, grid search is an exhaustive search that trains and evaluates a model for each possible combination of the hyperparameter values supplied by the developer. A disadvantage of grid search is that it is computationally costly for even small sets of hyperparameter values. Fortunately, it is an embarrassingly parallel problem; many models can easily be trained and evaluated concurrently since no synchronization is required between the processes. Let's use scikit-learn's GridSearchCV() function to find better hyperparameter values:
In [30]:
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english')),
('clf', LogisticRegression())
])
parameters = {
'vect__max_df': (0.25, 0.5, 0.75),
'vect__stop_words': ('english', None),
'vect__max_features': (2500, 5000, 10000, None),
'vect__ngram_range': ((1, 1), (1, 2)),
'vect__use_idf': (True, False),
'vect__norm': ('l1', 'l2'),
'clf__penalty': ('l1', 'l2'),
'clf__C': (0.01, 0.1, 1, 10),
}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=3)
grid_search.fit(X_train_raw, y_train)
Out[30]:
In [38]:
print( 'Best score: %0.3f' % grid_search.best_score_)
print( 'Best parameters set:')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print( '\t%s: %r' % (param_name, best_parameters[param_name]))
predictions = grid_search.predict(X_test_raw)
predictions = [0 if i == 'ham' else 1 for i in predictions]
print( 'Accuracy:', accuracy_score(y_test, predictions))
print( 'Precision:', precision_score(y_test, predictions))
print( 'Recall:', recall_score(y_test, predictions))
In [ ]: