Predicting Pokemon Type Using Damage Stats

Data originally from Kaggle datasets: Pokemon with stats.

From Kaggle's description:

This data set includes 721 Pokemon, including their number, name, first and second type, and basic stats.

This are the raw attributes that are used for calculating how much damage an attack will do in the games.

1 Imports

Apart from the usual imports (i.e. numpy, pandas and matplotlib), we will be importing:

  • accuracy_score, precision_score and recall_score from sklearn's metrics module, for model evaluation
  • train_test_split, a method from sklearn.model_selection that conveniently partitions the raw data into training and test sets
  • DecisionTreeClassifier from sklearn.tree, a classifier that we will use to exemplify overfitting with and without training and test split.

In [1]:
import numpy  as np
import pandas as pd

% matplotlib inline
from matplotlib import pyplot as plt

from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

2 About the data

We will import the Pokemon data from the data folder, placed in the root directory for the unit, using the read_csv method from pandas.


In [2]:
data = pd.read_csv('../data/pokemon.csv')

For convenience, we will rename all columns to upper case, so we don't have to remember what is upper or lower case in the future.


In [3]:
data.columns = data.columns.str.upper()

We will also change the index of the dataframe to be the Pokemon's name, and we will use some magic to make names look nicer.


In [4]:
data = data.set_index('NAME')
data.index = data.index.str.replace(".*(?=Mega|Primal|Origin|Therian|Land|Incarnate)", "")
data = data.drop(['#'], axis=1)

We are ready to take a look at the data for the first time!

Instead of simply calling head on the dataframe, let's precede it with sort_values, to get the Top 3 most powerful Pokemon in the dataset.


In [5]:
most_powerful = data.sort_values('TOTAL', ascending=False)
most_powerful.head(n=3)


Out[5]:
TYPE 1 TYPE 2 TOTAL HP ATTACK DEFENSE SP. ATK SP. DEF SPEED GENERATION LEGENDARY
NAME
Mega Rayquaza Dragon Flying 780 105 180 100 180 100 115 3 True
Mega Mewtwo Y Psychic NaN 780 106 150 70 194 120 140 1 True
Mega Mewtwo X Psychic Fighting 780 106 190 100 154 100 130 1 True

In doubt, this is a Mega Rayquaza. But what about the most powerful Pokemon by type (Type 1)?


In [6]:
most_powerful_by_type = most_powerful.drop_duplicates(subset=['TYPE 1'], keep='first')
most_powerful_by_type


Out[6]:
TYPE 1 TYPE 2 TOTAL HP ATTACK DEFENSE SP. ATK SP. DEF SPEED GENERATION LEGENDARY
NAME
Mega Rayquaza Dragon Flying 780 105 180 100 180 100 115 3 True
Mega Mewtwo Y Psychic NaN 780 106 150 70 194 120 140 1 True
Primal Kyogre Water NaN 770 100 150 90 180 160 90 3 True
Primal Groudon Ground Fire 770 100 180 160 150 90 90 3 True
Arceus Normal NaN 720 120 120 120 120 120 120 4 True
Mega Metagross Steel Psychic 700 80 145 150 105 110 110 3 False
Mega Tyranitar Rock Dark 700 100 164 150 95 120 71 2 False
Origin Forme Ghost Dragon 680 150 120 100 120 100 90 4 True
Ho-oh Fire Flying 680 106 130 90 110 154 90 2 True
Xerneas Fairy NaN 680 126 131 95 131 98 99 6 True
Yveltal Dark Flying 680 126 131 95 131 98 99 6 True
Mega Sceptile Grass Dragon 630 70 110 75 145 85 145 3 False
Mega Lucario Fighting Steel 625 70 145 88 140 70 112 4 False
Mega Ampharos Electric Dragon 610 90 95 105 165 110 45 2 False
Mega Scizor Bug Steel 600 70 150 140 65 100 75 2 False
Mega Glalie Ice NaN 580 80 120 80 120 80 100 3 False
Therian Forme Flying NaN 580 79 100 80 110 90 121 5 True
Crobat Poison Flying 535 85 90 80 70 80 130 2 False

3 Pre-processing data

We will start by selecting the features we want to use and the label.

We will try to predict the Pokemon type using three main features to represent the Pokemon entity:

  • Attack
  • Defence
  • Speed.

In [7]:
columns = ['ATTACK', 'DEFENSE', 'SPEED', 'TYPE 1']
data_clf = data[columns]

Now, briefly describing the raw dataset.


In [8]:
print("The dataset contains %s rows and %s columns." % data.shape)
print("The dataset columns are: %s."% data.columns.values)
data_clf.describe()


The dataset contains 800 rows and 11 columns.
The dataset columns are: ['TYPE 1' 'TYPE 2' 'TOTAL' 'HP' 'ATTACK' 'DEFENSE' 'SP. ATK' 'SP. DEF'
 'SPEED' 'GENERATION' 'LEGENDARY'].
Out[8]:
ATTACK DEFENSE SPEED
count 800.000000 800.000000 800.000000
mean 79.001250 73.842500 68.277500
std 32.457366 31.183501 29.060474
min 5.000000 5.000000 5.000000
25% 55.000000 50.000000 45.000000
50% 75.000000 70.000000 65.000000
75% 100.000000 90.000000 90.000000
max 190.000000 230.000000 180.000000

Time to separate our features from the labels, and we're ready to train a simple model.


In [9]:
X = data_clf.drop(['TYPE 1'], axis=1)
y = data_clf['TYPE 1']

3 Training and testing a model in a single dataset

We want to use a Decision Tree classifier.


In [11]:
clf = DecisionTreeClassifier(random_state=0)

Using the same dataset for training and testing our model, we get a remarkable accuracy score!


In [12]:
model_using_all_data = clf.fit(X, y)

y_pred = model_using_all_data.predict(X)

accuracy_using_all_data = accuracy_score(y, y_pred)
print("The accuracy score of the model is: %s." % accuracy_using_all_data)

results_using_all_data = data
results_using_all_data['PREDICTED'] = y_pred

failures = results_using_all_data['TYPE 1'] != results_using_all_data['PREDICTED']
results_using_all_data[failures].head(n=10)


The accuracy score of the model is: 0.965.
Out[12]:
TYPE 1 TYPE 2 TOTAL HP ATTACK DEFENSE SP. ATK SP. DEF SPEED GENERATION LEGENDARY PREDICTED
NAME
Golbat Poison Flying 455 75 80 70 65 75 90 1 False Normal
Poliwrath Water Fighting 510 90 95 95 70 90 70 1 False Fighting
Espeon Psychic NaN 525 65 65 60 130 95 110 2 False Ghost
Grovyle Grass NaN 405 50 65 45 85 65 95 3 False Bug
Lotad Water Grass 220 40 30 30 40 50 30 3 False Grass
Lombre Water Grass 340 60 50 50 60 70 50 3 False Ice
Ludicolo Water Grass 480 80 70 70 90 100 70 3 False Normal
Nuzleaf Grass Dark 340 70 70 40 60 40 60 3 False Fighting
Lairon Steel Rock 430 60 90 140 50 50 40 3 False Bug
Kecleon Normal NaN 440 60 90 70 60 120 40 3 False Bug

4 Using training and test sets

We're going to leave 20% of the total data aside for testing, that we will not use to train our model.


In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print("The training dataset contains %s rows and %s columns." % X_train.shape)
print("The test dataset contains %s rows and %s columns." % X_test.shape)


The training dataset contains 640 rows and 3 columns.
The test dataset contains 160 rows and 3 columns.

Now, we will use one partition for training and the other for testing or evaluating model performance on previously unseen data.


In [14]:
model_using_test_data = clf.fit(X_train, y_train)

y_pred = model_using_test_data.predict(X_test)

accuracy_using_test_data = accuracy_score(y_test, y_pred)
print("The accuracy score of the model for previously unseen data is: %s." % accuracy_using_test_data)


The accuracy score of the model for previously unseen data is: 0.1125.

5 Using train, validation and test split

Using a test dataset, there is still the risk of overfitting on the test set.

To avoid knowledge about the test set to "leak" into the model, we may want to hold out a validation set.

By partitioning our data into three sets, we drastically reduce the number of samples which can be used for training. Often times, this is not affordable.

In this cases, the common approach is to use cross-validation, something you learn about in the upcoming units.