Using Pokemon Dataset from Kaggle: Link
This dataset contains a full set of in-game statistics for all 802 pokemon in the Sun and Moon. It also includes full information on which pokemon can learn which moves (movesets.csv), what moves can do (moves.csv), and how damage is modified by pokemon type (type-chart.csv). But for this project I am going to ignore them.
In [1]:
# Import the pandas library
import pandas as pd
In [2]:
# Read csv file from the path and store it in df
df = pd.read_csv('./eneskemal_HW.csv', encoding="ISO-8859-1",
usecols=[3,4,5,9,10,11,12,13,14,15]) # Specific columns to use
# Show the first 5 row of the data
df.head()
# Show the last 5 row of the data
# df.tail()
Out[2]:
In [3]:
# Check if missing values
df.count(0)
Out[3]:
In [4]:
# Show the general information about the data
df.info()
Generate descriptive statistics of all columns (input and output) of your dataset. Descriptive statistics for numerical columns include: count, mean, std, min, 25 percentile (Q1), 50 percentile (Q2, median), 75 percentile (Q3), max values of the columns. For categorical columns, determine distinct values and their frequency in each categorical column.
Hint: Pandas, data frame describe() function.
In [5]:
# Descriptive information of the numerical columns
df.describe()
Out[5]:
In [6]:
# Categorical descriptive info for Type1 column
df['type1'].describe()
Out[6]:
In [7]:
# Categorical descriptive info for Type2 column
df['type2'].describe()
Out[7]:
If the output column is numerical then calculate the IQR (inter quartile range, Q3-Q1) and Range (difference between max and min value). If your output column is categorical then determine if the column is nominal or ordinal, why?. Is there a class imbalance problem? (check if there is big difference between the number of distinct values in your categorical output column)
In [8]:
df['total'].describe()
Out[8]:
In [9]:
# I want to just analyze the data but let's say my output
# is total column in this case:
tot_info = df['total'].describe()
print("(IQR)-Interquartile Range: ", tot_info['75%'] - tot_info['25%'])
print("Range:", tot_info['max'] - tot_info['min'])
Notes: Output column is numerical data.
In [10]:
# Matplotlib for additional customization
from matplotlib import pyplot as plt
%matplotlib inline
# Seaborn for plotting and styling
import seaborn as sns
# Seaborn configurations
sns.set(style="whitegrid", color_codes=True)
sns.set_style("ticks")
In [11]:
# Plotting all the numerical columns to combined boxplot
sns.boxplot(data=df[(list(range(10))[3:])])
plt.title("Boxplot of all numerical data", fontsize=20)
plt.ylabel("Numerical Values")
plt.xlabel("Data Columns")
sns.despine()
plt.show()
In [13]:
# Plotting categorical type 1 and type2 data to pie plot
fig = plt.figure(figsize=(15,20))
ax1 = fig.add_subplot(211)
df['type1'].value_counts().plot(kind='pie')
ax2 = fig.add_subplot(212)
df['type2'].value_counts().plot(kind='pie')
plt.show()
In [14]:
# Defining a function to use 8 times
def dist_hist_plot(name):
sns.distplot(df[name])
plt.title((name+" points distribution"), fontsize=20)
plt.show()
In [15]:
lst_names = ["hp","attack","defense","spattack","spdefense","speed","total"]
# Plotting distribution for all numerical columns
for name in lst_names:
dist_hist_plot(name)
In [16]:
# num_df = df[(list(range(10))[3:])] # Selecting numerical data only
# pairwising with type1 categorical data
sns.pairplot(df, hue="type1", palette="husl")
plt.show()
In [17]:
df.corr()
Out[17]:
In [20]:
df.corr() > 0.8
# As you can see not really useful data which gives me
# not useful correlations
Out[20]:
In [21]:
fig = plt.figure(figsize=(15,20))
cor = df.corr()
sns.heatmap(cor)
plt.show()
Select one of the numerical input columns in your dataset, and generate scatter plot of output column versus the input column. If the output column is categorical then generate the box plot of the input column for each distinct value of the output column. Let’s say if your output has three distinct categorical values, plot one box plot of the input column for each value (three) in the output column.
Hint: check examples in Pandas, Matplotlib, plot(), scatter(), groupby() getgroup() functions
In [22]:
sns.lmplot(x='attack', y='defense', data=df)
plt.title("Attack points vs Defense points", fontsize=20)
plt.show()