In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
In [2]:
import warnings
warnings.filterwarnings('ignore')
In [3]:
plt.rcParams['figure.figsize'] = 9, 6
In [4]:
from sklearn.datasets import load_breast_cancer
In [5]:
data = load_breast_cancer()
label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']
In [6]:
label_names
Out[6]:
In [7]:
labels[-10:]
Out[7]:
In [8]:
feature_names
Out[8]:
In [9]:
features
Out[9]:
In [10]:
df = pd.DataFrame(features)
df.columns = feature_names
df['target'] = labels
df.head()
Out[10]:
In [12]:
# %%time
# seaborn.pairplot(df, hue="target")
# plt.savefig('img/breast_cancer_pairplot.png')
## CPU times: user 4min 46s, sys: 3min 59s, total: 8min 46s
## Wall time: 4min 13s
In [13]:
from sklearn.model_selection import train_test_split
train, test, train_labels, test_labels = train_test_split(features,
labels,
test_size=0.33,
random_state=42)
Nikdy sa nesmie model trenovat a testovat na tych istych datach. Hrozilo by nam, ze
In [14]:
# natrenujeme si nejaky klasifikator
from sklearn.ensemble import RandomForestClassifier
cls = RandomForestClassifier()
model = cls.fit(train, train_labels)
Teraz uz mame natrenovany model na trenovacich datach. Co sa stane ak skusime predikovat vysledky na trenovacich datach?
In [15]:
cls.score(train, train_labels)
Out[15]:
In [16]:
from sklearn.metrics import accuracy_score
train_preds = model.predict(train)
accuracy_score(train_labels, train_preds)
Out[16]:
Tento vysledok presnosti klasifikacie je na datach, ktore model pouzival na ucenie. Nehovori nam to nic o tom ako sa bude model spravat v pripade ak mu ukazeme data, ktore nikdy nevidel. Prave kvoli tomu sme si vytvorili testovaciu vzorku. Testovacie data nikdy neboli pouzite na trenovanie a teda uspesnost na tychto datach by nam mala dat realnejsiu predstavu o kvalite modelu.
In [17]:
cls.score(test, test_labels)
Out[17]:
In [18]:
preds = model.predict(test)
accuracy_score(test_labels, preds)
Out[18]:
Uspesnost na testovacej vzorke je spravidla mensia ako na trenovacej vzorke. Ved ten model nikdy testovacie data nevidel a teda moze vyuzivat len skryte vztahy, ktore sa skutocne naucil.
Ak je velky rozdiel medzi trenovacou a testovacou chybou, tak by to mohlo znamenat, ze je model pretrenovany. Ze sa naucil data naspamat a nie tie skryte pravidla, ktore su vseobecne a platne aj pre data, ktore nikdy nevidel.
In [19]:
from sklearn.model_selection import cross_val_score
cls = RandomForestClassifier()
scores = cross_val_score(cls, features, labels, cv=5)
print(scores)
print(scores.mean())
In [20]:
plt.boxplot(scores)
Out[20]:
In [21]:
%%time
from sklearn.model_selection import LeaveOneOut
cls = RandomForestClassifier()
scores = cross_val_score(cls, features, labels, cv=LeaveOneOut())
print(scores)
print(scores.mean())
Pri leave one out sa pouziva maximalne mnozstvo dat na trenovanie a zaroven aj vsetky data na testovanie. Samostatne skore ale nemaju zmysel. Zmysel ma len ich priemer.
Nijak. Pouzivajte krizovu validaciu na najdenie najlepsieho nastavenia (vyber algoritmov, atributov a hyperparametrov). Ked toto spravite, tak natrenujte uplne novy model na vsetkych datach, ktore mate a pustite to do produkcie.
Pokial nepotrebujete niekomu vopred odhadnut aka asi bude uspesnost na neznamys datach. Vtedy budete potrebovat nejaku testovaciu vzorku, ktoru nepouzijete na trenovanie.
Existuje strasne vela roznych metrik na vyhodnocovanie klasifikacie/regresie/zhlukovania ... Vela z nich je uz predpriprvenych v SciKit kniznici.
http://scikit-learn.org/stable/modules/model_evaluation.html
Dnes sa budeme venovat hlavne klasifikacii, ale vsetky tieto koncepty su aplikovatelne aj pre ine ulohy analyzy dat, len s pouzitim zodpovedajucich metrik na vyhodnocovanie.
Tu je pekny obrazok z wikipedia, kde su zobrazene aj rozne metriky, ktore sa z matice zamien daju spocitat
Zdroj obrazka: https://en.wikipedia.org/wiki/Precision_and_recall
In [57]:
from sklearn.ensemble import RandomForestClassifier
cls = RandomForestClassifier()
model = cls.fit(train, train_labels)
cls.score(test, test_labels) # toto pocita metriku accuracy (spravnost)
Out[57]:
In [58]:
preds = model.predict(test)
accuracy_score(test_labels, preds)
Out[58]:
Accuracy (spravnost) - kolko pozoorvani ste oznacili spravne $$accuracy = \frac{TP + TN}{ALL}$$
In [59]:
from sklearn.metrics import confusion_matrix
confusion_matrix(test_labels, preds)
Out[59]:
In [61]:
(64 + 114) / (64 + 3 + 7 + 114)
Out[61]:
In [62]:
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
X, y = make_classification(n_classes=2, class_sep=1, weights=[0.95, 0.05],
n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1,
n_samples=5000, random_state=10)
pca = PCA(n_components=2)
X_vis = pca.fit_transform(X)
palette = seaborn.color_palette()
plt.scatter(X_vis[y == 0, 0], X_vis[y == 0, 1], label="Class #0", alpha=0.5,
facecolor=palette[0], linewidth=0.15)
plt.scatter(X_vis[y == 1, 0], X_vis[y == 1, 1], label="Class #1", alpha=0.5,
facecolor=palette[2], linewidth=0.15)
plt.scatter(X_vis[y == 2, 0], X_vis[y == 2, 1], label="Class #2", alpha=0.5,
facecolor=palette[3], linewidth=0.15)
Out[62]:
In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
In [28]:
from sklearn.neighbors import KNeighborsClassifier
cls = KNeighborsClassifier(3)
model = cls.fit(X_train, y_train)
y_pred = cls.predict(X_test)
accuracy_score(y_test, y_pred)
Out[28]:
In [29]:
accuracy_score(y_test, np.zeros(len(y_pred)))
Out[29]:
Precision (presnost) - kolko z tych, co ste oznacili ako pozitivna trieda malo naozaj pozitivnu triedu $$precision = \frac{TP}{TP + FP}$$
Recall (pokrytie) - Kolko zo vsetkych s pozitivnou triedou ste dokazali oznacit ako pozitivna trieda $$recall = \frac{TP}{TP + FN}$$
In [30]:
from sklearn.metrics import precision_score
print(precision_score(y_test, y_pred))
print(precision_score(y_test, np.zeros(len(y_pred))))
In [31]:
from sklearn.metrics import recall_score
print(recall_score(y_test, y_pred))
print(recall_score(y_test, np.zeros(len(y_pred))))
In [32]:
from sklearn.metrics import f1_score
print(f1_score(y_test, y_pred))
print(f1_score(y_test, np.zeros(len(y_pred))))
recall sa pocita obdobne
Ak chcete prezentovat jedno cislo, tak sa pocita priemerna hodnota. To ake vahy sa priradia jednotlivym triedam rozhoduje o vlastnostiach vysledneho cisla (teda pri nevyvazenych datasetoch).
In [33]:
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
X, y = make_classification(n_classes=3, class_sep=2, weights=[0.9, 0.06, 0.04],
n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1,
n_samples=5000, random_state=10)
pca = PCA(n_components=2)
X_vis = pca.fit_transform(X)
palette = seaborn.color_palette()
plt.scatter(X_vis[y == 0, 0], X_vis[y == 0, 1], label="Class #0", alpha=0.5,
facecolor=palette[0], linewidth=0.15)
plt.scatter(X_vis[y == 1, 0], X_vis[y == 1, 1], label="Class #1", alpha=0.5,
facecolor=palette[2], linewidth=0.15)
plt.scatter(X_vis[y == 2, 0], X_vis[y == 2, 1], label="Class #2", alpha=0.5,
facecolor=palette[3], linewidth=0.15)
Out[33]:
In [34]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
In [35]:
from sklearn.neighbors import KNeighborsClassifier
cls = KNeighborsClassifier(3)
model = cls.fit(X_train, y_train)
y_pred = cls.predict(X_test)
accuracy_score(y_test, y_pred)
Out[35]:
In [36]:
confusion_matrix(y_test, y_pred)
Out[36]:
In [37]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, digits=5))
In [38]:
print(precision_score(y_test, y_pred, average='weighted')) # pouziva sa vaha rovna poctu pozoorvani per trieda
print(precision_score(y_test, y_pred, average='micro')) # pouziva sa globalny pocet TP, FN a FP
print(precision_score(y_test, y_pred, average='macro')) # spocita presnost pre kazdu triedu zvlast a spocita ich nevazeny priemer
Ak mame nevyvazeny dataset a zaujimaju nas aj triedy, ktore su malo pocetne, tak macro priemerovanie je velmi uzitocne
Obdobny sposob vypoctu je aj pre dalsie metriky ako naporiklad Recall alebo F1
Roznych metrik existuje velmi vela a v roznych domenach sa pouzivaju rozne. Nemame teraz priestor si ich prechadzat, takze len zoznam zopar z nich:
Rozne klasifikatory maju rozne parametre, ktore mozeme nastavovat a tak zlepsovat ich uspesnost na konkretnom datasete. Tieto sa volaju Hyperparametre.
Mozeme ich skusat nastavovat manualne, intuitivne, alebo mozeme na to ist hrubou silou.
In [39]:
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
n_samples, n_features = X.shape
# Trosku zasumime tie data, aby to vyzeralo krajsie
random_state = np.random.RandomState(0)
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]
In [40]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier
cv_params = {'max_depth': [1,2,3,4] + list(range(5,10,2)), 'criterion': ['gini', 'entropy'], 'min_samples_leaf': [1, 3] }
ind_params = {'random_state': 0}
optimization = GridSearchCV(clf(**ind_params),
cv_params,
scoring = 'f1_macro', cv = 5, n_jobs = -1, verbose=True)
In [41]:
iris.data.shape # to znamena, ze hlbka stromu 4 by mala stacit. Uvidime co to spravi
Out[41]:
In [42]:
X.shape # uvidime, aky vplyv na vykon podla hlbky stromu bude mat ten sum
Out[42]:
In [43]:
# Niez to spustime, tak si skusme najskor pozriet, ake vsetky kombinacie to bude skusat
from sklearn.grid_search import ParameterGrid
list(ParameterGrid(cv_params))
Out[43]:
In [44]:
%%time
optimization.fit(X, y)
Out[44]:
In [45]:
optimization.grid_scores_
Out[45]:
In [46]:
sorted(optimization.grid_scores_, key=lambda x: x.mean_validation_score, reverse=True)
Out[46]:
In [47]:
list(filter(lambda x: 'best' in x, dir(optimization)))
Out[47]:
In [48]:
optimization.best_estimator_
Out[48]:
Tym, ze hladam najlepsiu uspesnost na testovacej vzorke pridavam dalsiu uroven trenovania a hrozi mi problem optimalizacie na testovacie data (vid. leaderboard optimization). Ak ma zaujima ocakavana uspesnost na datach, ktore model nikdy nevidel, tak by som mal mat este jednu (validacnu) vzorku, na ktorej overim len ten uplne posledny model. Uspesnost na tejto vzorke bude uspesnost, ktoru budem moct ocakavat v produkcii, na datach, ktore model nikdy nevidel.
Bias je chyba spôsobená aproximáciou zložitého problému jednoduchším modelom - nieco nam v tom modely chyba
Variancia nám hovorí, ako velmi by sa zmenil model, ak by sme použili inú trénovaciu sadu
Tieto dve chyby nevieme od seba oddelit, ale snazime sa najst bod, kde su v minime
zdroj obrazkov:
In [ ]: