This notebook will perform exploratory analysis on the european soccer dataset before new feature creation.
Additional exploration of new features is located within the feature creation notebook.
In [200]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
First step is to read in the csv files created by the extraction notebook
In [201]:
matches = pd.read_csv('/Users/mtetkosk/Google Drive/Data Science Projects/data/processed/EPL_matches.csv')
print len(matches)
print matches.head()
In [202]:
matches.columns[:11] # Columns 1 - 10 identify the match and the number of goals scored by each team
Out[202]:
In [203]:
matches.columns[85:] # Columns 85 - 115 are betting odds from different websites
Out[203]:
In [204]:
matches.columns[11:55] # Columns 11-55 are (X,Y) coordinates for players on the pitch - Describing formation
Out[204]:
In [205]:
matches.columns[55:85] # Columns 55 - 77 give the player names. Columns 77-84 give some statistics based on the match.
Out[205]:
Lets remove any variables from matches df that we won't need for this analysis
In [206]:
matches_reduced = matches.copy()
removecols = matches.columns[11:85]
removecols_other = ['country_id','league_id']
In [207]:
for col in matches_reduced.columns:
if col in removecols or col in removecols_other:
del matches_reduced[col]
In [208]:
print matches_reduced.shape #Reduced from 115 columns to 106 columns
In [209]:
matches_reduced.season.value_counts() #Equal numer of matches per-season
Out[209]:
In [210]:
# What does the 'stage' variable mean?
matches_reduced[matches_reduced.season=='2008/2009'].stage.value_counts()
Out[210]:
'Stage' variable must mean 'week' of the season. Each 'stage' consists of 10 matches. This is a way to group matches by date.
In [211]:
matches_reduced.head()
Out[211]:
Now let's check for missing values
In [212]:
null_dict = {}
for col in matches_reduced.columns[4:]:
nulls = matches_reduced[col].isnull().sum()
if nulls > 0:
null_dict[col] = nulls
null_dict
Out[212]:
Many of the betting odds have null values. Let's remove the columns that have excessive nulls.
In [213]:
for key in null_dict.keys():
if null_dict[key] > 10:
del matches_reduced[key]
matches_reduced.shape
Out[213]:
In [214]:
matches_reduced.to_csv('/Users/mtetkosk/Google Drive/Data Science Projects/data/processed/EPL_Matches_Reduced.csv',index= False)
In [215]:
team_attributes = pd.read_csv('/Users/mtetkosk/Google Drive/Data Science Projects/data/processed/EPL_team_attributes.csv')
print len(team_attributes)
print team_attributes.head()
In [217]:
team_attributes['date']
Out[217]:
In [75]:
team_attributes.columns
Out[75]:
In [97]:
null_dict = {}
for col in team_attributes.columns[4:]:
nulls = team_attributes[col].isnull().sum()
if nulls > 0:
null_dict[col] = nulls
if team_attributes[col].dtype == 'int64' or team_attributes[col].dtype == 'float64':
team_attributes[col].plot(kind = 'hist')
plt.xlabel(col)
plt.title(col + 'Histogram')
plt.show()
elif team_attributes[col].dtype == 'object':
team_attributes[col].value_counts().plot(kind ='bar') #Build up play passing class value counts totals to 204, no nulls
plt.title(col + 'Bar Chart')
plt.show()
In [98]:
null_dict
Out[98]:
From 'null_dict' object, only the attribute 'buildUpPlayDribbling' numeric attribute has null values.
In [9]:
teams = pd.read_csv('/Users/mtetkosk/Google Drive/Data Science Projects/data/processed/EPL_teams.csv')
print len(teams)
print teams.head()
In [218]:
teams.head()
Out[218]:
In [12]:
player_attributes = pd.read_csv('/Users/mtetkosk/Google Drive/Data Science Projects/data/processed/Player_Attributes.csv')
print len(player_attributes)
print player_attributes.head()
In [13]:
players = pd.read_csv('/Users/mtetkosk/Google Drive/Data Science Projects/data/processed/Players.csv')
print len(players)
print players.head()
In [ ]: