Lesson

  • Practice using the Accuracy and LogLoss metrics on a classification problem.
  • Practice generating a confusion matrix and a classification report.
  • Practice using RMSE and RSquared metrics on a regression problem

In [1]:
# Cross Validation Classification LogLoss
import pandas as pd
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
names =  ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv(url, names=names)

X = df.values[:, 0:8]
y = df.values[:, 8]

In [2]:
# model config
num_folds = 10
num_instances = len(X)
seed = 7

estimator = LogisticRegression()
# scoring = 'log_loss'
scoring = 'accuracy'
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)

In [3]:
results = cross_validation.cross_val_score(estimator, X, y, cv=kfold, scoring=scoring)

In [4]:
accuracy = [results.mean(), results.std()]
print(scoring + ":\n\tMean\t{:5.2}\n\tStd.\t{:5.2}".format(accuracy[0], accuracy[1]))


accuracy:
	Mean	 0.77
	Std.	0.048

In [5]:
# outcome of scores for each fold
results


Out[5]:
array([ 0.7012987 ,  0.81818182,  0.74025974,  0.71428571,  0.77922078,
        0.75324675,  0.85714286,  0.80519481,  0.72368421,  0.80263158])

In [6]:
from sklearn.metrics import confusion_matrix

model = estimator.fit(X, y)
cm = confusion_matrix(y, model.predict(X))

In [7]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

cm_df = pd.DataFrame(cm)
sns.heatmap(cm_df, annot=True, fmt="d")


C:\Miniconda3\envs\sandpit\lib\site-packages\IPython\html.py:14: ShimWarning: The `IPython.html` package has been deprecated. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
  "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0xb0707f0>

In [8]:
help(cross_validation.cross_val_score)


Help on function cross_val_score in module sklearn.cross_validation:

cross_val_score(estimator, X, y=None, scoring=None, cv=None, n_jobs=1, verbose=0, fit_params=None, pre_dispatch='2*n_jobs')
    Evaluate a score by cross-validation
    
    Read more in the :ref:`User Guide <cross_validation>`.
    
    Parameters
    ----------
    estimator : estimator object implementing 'fit'
        The object to use to fit the data.
    
    X : array-like
        The data to fit. Can be, for example a list, or an array at least 2d.
    
    y : array-like, optional, default: None
        The target variable to try to predict in the case of
        supervised learning.
    
    scoring : string, callable or None, optional, default: None
        A string (see model evaluation documentation) or
        a scorer callable object / function with signature
        ``scorer(estimator, X, y)``.
    
    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:
    
        - None, to use the default 3-fold cross-validation,
        - integer, to specify the number of folds.
        - An object to be used as a cross-validation generator.
        - An iterable yielding train/test splits.
    
        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.
    
        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validation strategies that can be used here.
    
    n_jobs : integer, optional
        The number of CPUs to use to do the computation. -1 means
        'all CPUs'.
    
    verbose : integer, optional
        The verbosity level.
    
    fit_params : dict, optional
        Parameters to pass to the fit method of the estimator.
    
    pre_dispatch : int, or string, optional
        Controls the number of jobs that get dispatched during parallel
        execution. Reducing this number can be useful to avoid an
        explosion of memory consumption when more jobs get dispatched
        than CPUs can process. This parameter can be:
    
            - None, in which case all the jobs are immediately
              created and spawned. Use this for lightweight and
              fast-running jobs, to avoid delays due to on-demand
              spawning of the jobs
    
            - An int, giving the exact number of total jobs that are
              spawned
    
            - A string, giving an expression as a function of n_jobs,
              as in '2*n_jobs'
    
    Returns
    -------
    scores : array of float, shape=(len(list(cv)),)
        Array of scores of the estimator for each run of the cross validation.