For most classification problems, it’s nice to have a simple, fast method to provide a quick baseline classification. If the simple and fast method is sufficient, then we don’t have to waste CPU cycles on more complex models. If not, we can use the results of the simple method to give us clues about our data.
One good method to keep in mind is Gaussian Naive Bayes (sklearn.naive_bayes.GaussianNB
).
Gaussian Naive Bayes fits a Gaussian distribution to each training label independantly on each feature, and uses this to quickly give a rough classification. It is generally not sufficiently accurate for real-world data, but can perform surprisingly well, for instance on text data.
In [1]:
from sklearn.datasets import load_digits
digits = load_digits()
In [2]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
In [3]:
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target)
In [6]:
print(len(X_train), len(X_test), y_train, y_test)
In [7]:
clf = GaussianNB()
clf.fit(X_train, y_train)
Out[7]:
In [8]:
predicted = clf.predict(X_test)
expected = y_test
print(predicted)
print(expected)
In [9]:
matches = (predicted == expected)
print(matches)
In [18]:
print(matches.sum())
print(len(matches))
qmp = matches.sum() / float(len(matches))
print(qmp)
We see that more than 80% of the 450 predictions match the input. But there are other more sophisticated metrics that can be used to judge the performance of a classifier: several are available in the sklearn.metrics submodule.
One of the most useful metrics is the classification_report, which combines several measures and prints a table with the results:
In [20]:
from sklearn import metrics
print(metrics.classification_report(expected, predicted))
Another enlightening metric for this sort of multi-label classification is a confusion matrix: it helps us visualize which labels are being interchanged in the classification errors:
In [21]:
print(metrics.confusion_matrix(expected, predicted))
the above metrix is shown as 0 1 2 3 4 5 6 7 8 9 both X and Y axis.
you can see that there is no confusion in 0 and there are 41 instances of it. you can see that there is some confusion in 1 and there are 31 instances it was correctly identified but it was confused with 6 ( 4 time), 8 (7 times) & 9 (1 time) and so on.
We see here that the numbers 1, 2, 3, and 9 are often being labeled 8.