Data originally from Kaggle datasets: Pokemon with stats.
From Kaggle's description:
This data set includes 721 Pokemon, including their number, name, first and second type, and basic stats.
This are the raw attributes that are used for calculating how much damage an attack will do in the games.
Apart from the usual imports (i.e. numpy, pandas and matplotlib), we will be importing:
accuracy_score, precision_score and recall_score from sklearn's metrics module, for model evaluationtrain_test_split, a method from sklearn.model_selection that conveniently partitions the raw data into training and test sets DecisionTreeClassifier from sklearn.tree, a classifier that we will use to exemplify overfitting with and without training and test split.
In [1]:
import numpy as np
import pandas as pd
% matplotlib inline
from matplotlib import pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
We will import the Pokemon data from the data folder, placed in the root directory for the unit, using the read_csv method from pandas.
In [2]:
data = pd.read_csv('../data/pokemon.csv')
For convenience, we will rename all columns to upper case, so we don't have to remember what is upper or lower case in the future.
In [3]:
data.columns = data.columns.str.upper()
We will also change the index of the dataframe to be the Pokemon's name, and we will use some magic to make names look nicer.
In [4]:
data = data.set_index('NAME')
data.index = data.index.str.replace(".*(?=Mega|Primal|Origin|Therian|Land|Incarnate)", "")
data = data.drop(['#'], axis=1)
We are ready to take a look at the data for the first time!
Instead of simply calling head on the dataframe, let's precede it with sort_values, to get the Top 3 most powerful Pokemon in the dataset.
In [5]:
most_powerful = data.sort_values('TOTAL', ascending=False)
most_powerful.head(n=3)
Out[5]:
In doubt, this is a Mega Rayquaza. But what about the most powerful Pokemon by type (Type 1)?
In [6]:
most_powerful_by_type = most_powerful.drop_duplicates(subset=['TYPE 1'], keep='first')
most_powerful_by_type
Out[6]:
We will start by selecting the features we want to use and the label.
We will try to predict the Pokemon type using three main features to represent the Pokemon entity:
In [7]:
columns = ['ATTACK', 'DEFENSE', 'SPEED', 'TYPE 1']
data_clf = data[columns]
Now, briefly describing the raw dataset.
In [8]:
print("The dataset contains %s rows and %s columns." % data.shape)
print("The dataset columns are: %s."% data.columns.values)
data_clf.describe()
Out[8]:
Time to separate our features from the labels, and we're ready to train a simple model.
In [9]:
X = data_clf.drop(['TYPE 1'], axis=1)
y = data_clf['TYPE 1']
We want to use a Decision Tree classifier.
In [11]:
clf = DecisionTreeClassifier(random_state=0)
Using the same dataset for training and testing our model, we get a remarkable accuracy score!
In [12]:
model_using_all_data = clf.fit(X, y)
y_pred = model_using_all_data.predict(X)
accuracy_using_all_data = accuracy_score(y, y_pred)
print("The accuracy score of the model is: %s." % accuracy_using_all_data)
results_using_all_data = data
results_using_all_data['PREDICTED'] = y_pred
failures = results_using_all_data['TYPE 1'] != results_using_all_data['PREDICTED']
results_using_all_data[failures].head(n=10)
Out[12]:
We're going to leave 20% of the total data aside for testing, that we will not use to train our model.
In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print("The training dataset contains %s rows and %s columns." % X_train.shape)
print("The test dataset contains %s rows and %s columns." % X_test.shape)
Now, we will use one partition for training and the other for testing or evaluating model performance on previously unseen data.
In [14]:
model_using_test_data = clf.fit(X_train, y_train)
y_pred = model_using_test_data.predict(X_test)
accuracy_using_test_data = accuracy_score(y_test, y_pred)
print("The accuracy score of the model for previously unseen data is: %s." % accuracy_using_test_data)
Using a test dataset, there is still the risk of overfitting on the test set.
To avoid knowledge about the test set to "leak" into the model, we may want to hold out a validation set.
By partitioning our data into three sets, we drastically reduce the number of samples which can be used for training. Often times, this is not affordable.
In this cases, the common approach is to use cross-validation, something you learn about in the upcoming units.