In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [2]:
from sklearn.datasets import load_breast_cancer
In [3]:
cancer = load_breast_cancer()
The data set is presented in a dictionary form:
In [4]:
cancer.keys()
Out[4]:
We can grab information and arrays out of this dictionary to set up our data frame and understanding of the features:
In [5]:
print(cancer['DESCR'])
In [6]:
cancer['feature_names']
Out[6]:
In [7]:
df_feat = pd.DataFrame(cancer['data'],columns=cancer['feature_names'])
df_feat.info()
In [8]:
cancer['target']
Out[8]:
In [9]:
df_target = pd.DataFrame(cancer['target'],columns=['Cancer'])
In [29]:
df_target.head()
Out[29]:
In [11]:
from sklearn.model_selection import train_test_split
In [12]:
X_train, X_test, y_train, y_test = train_test_split(df_feat, np.ravel(df_target), test_size=0.30, random_state=101)
In [13]:
from sklearn.svm import SVC
In [14]:
model = SVC()
In [15]:
model.fit(X_train,y_train)
Out[15]:
In [16]:
predictions = model.predict(X_test)
In [17]:
from sklearn.metrics import classification_report,confusion_matrix
In [18]:
print(confusion_matrix(y_test,predictions))
In [19]:
print(classification_report(y_test,predictions))
Notice that we are classifying everything into a single class! This means our model needs to have it parameters adjusted (it may also help to normalize the data).
Finding the right parameters (like what C or gamma values to use) is a tricky task! This idea of creating a 'grid' of parameters and just trying out all the possible combinations is called a Gridsearch, this method is common enough that Scikit-learn has this functionality built in with GridSearchCV! The CV stands for cross-validation which is the GridSearchCV takes a dictionary that describes the parameters that should be tried and a model to train. The grid of parameters is defined as a dictionary, where the keys are the parameters and the values are the settings to be tested.
In [20]:
param_grid = {'C': [0.1,1, 10, 100, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf']}
In [21]:
from sklearn.model_selection import GridSearchCV
In [22]:
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=3)
What fit does is a bit more involved then usual. First, it runs the same loop with cross-validation, to find the best parameter combination. Once it has the best combination, it runs fit again on all data passed to fit (without cross-validation), to built a single new model using the best parameter setting.
In [23]:
# May take awhile!
grid.fit(X_train,y_train)
Out[23]:
You can inspect the best parameters found by GridSearchCV in the bestparams attribute, and the best estimator in the best_estimator_ attribute:
In [24]:
grid.best_params_
Out[24]:
In [25]:
grid.best_estimator_
Out[25]:
Then you can re-run predictions on this grid object just like you would with a normal model.
In [26]:
grid_predictions = grid.predict(X_test)
In [27]:
print(confusion_matrix(y_test,grid_predictions))
In [28]:
print(classification_report(y_test,grid_predictions))