Predicting Pokemon Type Using Damage Stats

Data originally from Kaggle datasets: Pokemon with stats.

From Kaggle's description:

This data set includes 721 Pokemon, including their number, name, first and second type, and basic stats.

This are the raw attributes that are used for calculating how much damage an attack will do in the games.

1 Imports

Apart from the usual imports (i.e. numpy, pandas and matplotlib), we will be importing:

accuracy_score, precision_score and recall_score from sklearn's metrics module, for model evaluation
train_test_split, a method from sklearn.model_selection that conveniently partitions the raw data into training and test sets
DecisionTreeClassifier from sklearn.tree, a classifier that we will use to exemplify overfitting with and without training and test split.



In [1]:

    
import numpy  as np
import pandas as pd

% matplotlib inline
from matplotlib import pyplot as plt

from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

2 About the data

We will import the Pokemon data from the data folder, placed in the root directory for the unit, using the read_csv method from pandas.



In [2]:

    
data = pd.read_csv('../data/pokemon.csv')

For convenience, we will rename all columns to upper case, so we don't have to remember what is upper or lower case in the future.



In [3]:

    
data.columns = data.columns.str.upper()

We will also change the index of the dataframe to be the Pokemon's name, and we will use some magic to make names look nicer.



In [4]:

    
data = data.set_index('NAME')
data.index = data.index.str.replace(".*(?=Mega|Primal|Origin|Therian|Land|Incarnate)", "")
data = data.drop(['#'], axis=1)

We are ready to take a look at the data for the first time!

Instead of simply calling head on the dataframe, let's precede it with sort_values, to get the Top 3 most powerful Pokemon in the dataset.



In [5]:

    
most_powerful = data.sort_values('TOTAL', ascending=False)
most_powerful.head(n=3)









    Out[5]:






  
    
      
      TYPE 1
      TYPE 2
      TOTAL
      HP
      ATTACK
      DEFENSE
      SP. ATK
      SP. DEF
      SPEED
      GENERATION
      LEGENDARY
    
    
      NAME
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Mega Rayquaza
      Dragon
      Flying
      780
      105
      180
      100
      180
      100
      115
      3
      True
    
    
      Mega Mewtwo Y
      Psychic
      NaN
      780
      106
      150
      70
      194
      120
      140
      1
      True
    
    
      Mega Mewtwo X
      Psychic
      Fighting
      780
      106
      190
      100
      154
      100
      130
      1
      True

In doubt, this is a Mega Rayquaza. But what about the most powerful Pokemon by type (Type 1)?



In [6]:

    
most_powerful_by_type = most_powerful.drop_duplicates(subset=['TYPE 1'], keep='first')
most_powerful_by_type









    Out[6]:






  
    
      
      TYPE 1
      TYPE 2
      TOTAL
      HP
      ATTACK
      DEFENSE
      SP. ATK
      SP. DEF
      SPEED
      GENERATION
      LEGENDARY
    
    
      NAME
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Mega Rayquaza
      Dragon
      Flying
      780
      105
      180
      100
      180
      100
      115
      3
      True
    
    
      Mega Mewtwo Y
      Psychic
      NaN
      780
      106
      150
      70
      194
      120
      140
      1
      True
    
    
      Primal Kyogre
      Water
      NaN
      770
      100
      150
      90
      180
      160
      90
      3
      True
    
    
      Primal Groudon
      Ground
      Fire
      770
      100
      180
      160
      150
      90
      90
      3
      True
    
    
      Arceus
      Normal
      NaN
      720
      120
      120
      120
      120
      120
      120
      4
      True
    
    
      Mega Metagross
      Steel
      Psychic
      700
      80
      145
      150
      105
      110
      110
      3
      False
    
    
      Mega Tyranitar
      Rock
      Dark
      700
      100
      164
      150
      95
      120
      71
      2
      False
    
    
      Origin Forme
      Ghost
      Dragon
      680
      150
      120
      100
      120
      100
      90
      4
      True
    
    
      Ho-oh
      Fire
      Flying
      680
      106
      130
      90
      110
      154
      90
      2
      True
    
    
      Xerneas
      Fairy
      NaN
      680
      126
      131
      95
      131
      98
      99
      6
      True
    
    
      Yveltal
      Dark
      Flying
      680
      126
      131
      95
      131
      98
      99
      6
      True
    
    
      Mega Sceptile
      Grass
      Dragon
      630
      70
      110
      75
      145
      85
      145
      3
      False
    
    
      Mega Lucario
      Fighting
      Steel
      625
      70
      145
      88
      140
      70
      112
      4
      False
    
    
      Mega Ampharos
      Electric
      Dragon
      610
      90
      95
      105
      165
      110
      45
      2
      False
    
    
      Mega Scizor
      Bug
      Steel
      600
      70
      150
      140
      65
      100
      75
      2
      False
    
    
      Mega Glalie
      Ice
      NaN
      580
      80
      120
      80
      120
      80
      100
      3
      False
    
    
      Therian Forme
      Flying
      NaN
      580
      79
      100
      80
      110
      90
      121
      5
      True
    
    
      Crobat
      Poison
      Flying
      535
      85
      90
      80
      70
      80
      130
      2
      False

3 Pre-processing data

We will start by selecting the features we want to use and the label.

We will try to predict the Pokemon type using three main features to represent the Pokemon entity:

Attack
Defence
Speed.



In [7]:

    
columns = ['ATTACK', 'DEFENSE', 'SPEED', 'TYPE 1']
data_clf = data[columns]

Now, briefly describing the raw dataset.



In [8]:

    
print("The dataset contains %s rows and %s columns." % data.shape)
print("The dataset columns are: %s."% data.columns.values)
data_clf.describe()









    



The dataset contains 800 rows and 11 columns.
The dataset columns are: ['TYPE 1' 'TYPE 2' 'TOTAL' 'HP' 'ATTACK' 'DEFENSE' 'SP. ATK' 'SP. DEF'
 'SPEED' 'GENERATION' 'LEGENDARY'].






    Out[8]:






  
    
      
      ATTACK
      DEFENSE
      SPEED
    
  
  
    
      count
      800.000000
      800.000000
      800.000000
    
    
      mean
      79.001250
      73.842500
      68.277500
    
    
      std
      32.457366
      31.183501
      29.060474
    
    
      min
      5.000000
      5.000000
      5.000000
    
    
      25%
      55.000000
      50.000000
      45.000000
    
    
      50%
      75.000000
      70.000000
      65.000000
    
    
      75%
      100.000000
      90.000000
      90.000000
    
    
      max
      190.000000
      230.000000
      180.000000

Time to separate our features from the labels, and we're ready to train a simple model.



In [9]:

    
X = data_clf.drop(['TYPE 1'], axis=1)
y = data_clf['TYPE 1']

3 Training and testing a model in a single dataset

We want to use a Decision Tree classifier.



In [11]:

    
clf = DecisionTreeClassifier(random_state=0)

Using the same dataset for training and testing our model, we get a remarkable accuracy score!



In [12]:

    
model_using_all_data = clf.fit(X, y)

y_pred = model_using_all_data.predict(X)

accuracy_using_all_data = accuracy_score(y, y_pred)
print("The accuracy score of the model is: %s." % accuracy_using_all_data)

results_using_all_data = data
results_using_all_data['PREDICTED'] = y_pred

failures = results_using_all_data['TYPE 1'] != results_using_all_data['PREDICTED']
results_using_all_data[failures].head(n=10)









    



The accuracy score of the model is: 0.965.






    Out[12]:






  
    
      
      TYPE 1
      TYPE 2
      TOTAL
      HP
      ATTACK
      DEFENSE
      SP. ATK
      SP. DEF
      SPEED
      GENERATION
      LEGENDARY
      PREDICTED
    
    
      NAME
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Golbat
      Poison
      Flying
      455
      75
      80
      70
      65
      75
      90
      1
      False
      Normal
    
    
      Poliwrath
      Water
      Fighting
      510
      90
      95
      95
      70
      90
      70
      1
      False
      Fighting
    
    
      Espeon
      Psychic
      NaN
      525
      65
      65
      60
      130
      95
      110
      2
      False
      Ghost
    
    
      Grovyle
      Grass
      NaN
      405
      50
      65
      45
      85
      65
      95
      3
      False
      Bug
    
    
      Lotad
      Water
      Grass
      220
      40
      30
      30
      40
      50
      30
      3
      False
      Grass
    
    
      Lombre
      Water
      Grass
      340
      60
      50
      50
      60
      70
      50
      3
      False
      Ice
    
    
      Ludicolo
      Water
      Grass
      480
      80
      70
      70
      90
      100
      70
      3
      False
      Normal
    
    
      Nuzleaf
      Grass
      Dark
      340
      70
      70
      40
      60
      40
      60
      3
      False
      Fighting
    
    
      Lairon
      Steel
      Rock
      430
      60
      90
      140
      50
      50
      40
      3
      False
      Bug
    
    
      Kecleon
      Normal
      NaN
      440
      60
      90
      70
      60
      120
      40
      3
      False
      Bug

4 Using training and test sets

We're going to leave 20% of the total data aside for testing, that we will not use to train our model.



In [13]:

    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print("The training dataset contains %s rows and %s columns." % X_train.shape)
print("The test dataset contains %s rows and %s columns." % X_test.shape)









    



The training dataset contains 640 rows and 3 columns.
The test dataset contains 160 rows and 3 columns.

Now, we will use one partition for training and the other for testing or evaluating model performance on previously unseen data.



In [14]:

    
model_using_test_data = clf.fit(X_train, y_train)

y_pred = model_using_test_data.predict(X_test)

accuracy_using_test_data = accuracy_score(y_test, y_pred)
print("The accuracy score of the model for previously unseen data is: %s." % accuracy_using_test_data)









    



The accuracy score of the model for previously unseen data is: 0.1125.

5 Using train, validation and test split

Using a test dataset, there is still the risk of overfitting on the test set.

To avoid knowledge about the test set to "leak" into the model, we may want to hold out a validation set.

By partitioning our data into three sets, we drastically reduce the number of samples which can be used for training. Often times, this is not affordable.

In this cases, the common approach is to use cross-validation, something you learn about in the upcoming units.

	TYPE 1	TYPE 2	TOTAL	HP	ATTACK	DEFENSE	SP. ATK	SP. DEF	SPEED	GENERATION	LEGENDARY
NAME
Mega Rayquaza	Dragon	Flying	780	105	180	100	180	100	115	3	True
Mega Mewtwo Y	Psychic	NaN	780	106	150	70	194	120	140	1	True
Mega Mewtwo X	Psychic	Fighting	780	106	190	100	154	100	130	1	True

	ATTACK	DEFENSE	SPEED
count	800.000000	800.000000	800.000000
mean	79.001250	73.842500	68.277500
std	32.457366	31.183501	29.060474
min	5.000000	5.000000	5.000000
25%	55.000000	50.000000	45.000000
50%	75.000000	70.000000	65.000000
75%	100.000000	90.000000	90.000000
max	190.000000	230.000000	180.000000

	TYPE 1	TYPE 2	TOTAL	HP	ATTACK	DEFENSE	SP. ATK	SP. DEF	SPEED	GENERATION	LEGENDARY	PREDICTED
NAME
Golbat	Poison	Flying	455	75	80	70	65	75	90	1	False	Normal
Poliwrath	Water	Fighting	510	90	95	95	70	90	70	1	False	Fighting
Espeon	Psychic	NaN	525	65	65	60	130	95	110	2	False	Ghost
Grovyle	Grass	NaN	405	50	65	45	85	65	95	3	False	Bug
Lotad	Water	Grass	220	40	30	30	40	50	30	3	False	Grass
Lombre	Water	Grass	340	60	50	50	60	70	50	3	False	Ice
Ludicolo	Water	Grass	480	80	70	70	90	100	70	3	False	Normal
Nuzleaf	Grass	Dark	340	70	70	40	60	40	60	3	False	Fighting
Lairon	Steel	Rock	430	60	90	140	50	50	40	3	False	Bug
Kecleon	Normal	NaN	440	60	90	70	60	120	40	3	False	Bug