Extracting features
Author: Carl Toews
File: extract_features.ipynb
In order to implement machine learning algorithms, we need to develop a set of informative features. Following Machine learning for predicting professional tennis matches [Sipko, 2015], for each match we assign one player to be "Player 0" and the other to be "Player 1", and call the outcome a 0 if Player 0 won and a 1 otherwise. For each match, we produce a set of features, each a measure of difference between some characteristic of the two players. The characteristics we consider include the following (many of these are from Sipko):
True if 1st match back since retirement)Note that rank, rankpts, height, and hand can be read from player data directly with no calculation. On the other hand, fsp, wfs, wss, wsp, wrp, tpw, acpg, dfpg, bps can be calculated for any given match, but for purposes of predicting a future match, need to be averaged over the historical record. Finally, tmw, retired, fatigue, complete, serveadv, and direct are not calculated "match-by-match", but rather derived from the historical record.
In order to derive these features, we'll first need to clean the data a bit. Specifically, we need to deal with missing or null values, as well as rectify incorrect values. Examples of issues include:
Import statements
In [474]:
import sqlalchemy # pandas-mysql interface library
import sqlalchemy.exc # exception handling
from sqlalchemy import create_engine # needed to define db interface
import sys # for defining behavior under errors
import numpy as np # numerical libraries
import scipy as sp
import pandas as pd # for data analysis
import pandas.io.sql as sql # for interfacing with MySQL database
import matplotlib as mpl # a big library with plotting functionality
import matplotlib.pyplot as plt # a subset of matplotlib with most of the useful tools
import IPython as IP
%matplotlib inline
import pdb
#%qtconsole
Player and odds data from 2010-2016 has beeen matched and stored. Retrieve, merge, and rename.
In [973]:
pickle_dir = '../pickle_files/'
odds_file = 'odds.pkl'
matches_file = 'matches.pkl'
odds= pd.read_pickle(pickle_dir + odds_file)
matches= pd.read_pickle(pickle_dir + matches_file)
data = pd.merge(matches,odds[['PSW','PSL','key']],how='inner',on='key')
Get additional training data. We'll include data from 2005, excluding Davis Cup matches, and focusing exclusively on players who played in our 2010-2016 set.
In [1710]:
# name of database
db_name = "tennis"
# name of db user
username = "testuser"
# db password for db user
password = "test623"
# location of atp data files
atpfile_directory = "../data/tennis_atp-master/"
# focus on most recent data; exclude Davis Cup stuff
startdate = '20050101'
enddate = '20161231'
engine = create_engine('mysql+mysqldb://' + username + ':' + password + '@localhost/' + db_name)
# get unique winners and losers in our set
players = tuple(pd.concat((data.winner_id,data.loser_id)).unique())
# load all data pertinant for any player
with engine.begin() as connection:
matches_hist = pd.read_sql_query("""SELECT * FROM matches \
WHERE tourney_date >= '""" + startdate + """' \
AND tourney_date <= '""" + enddate + """' \
AND (winner_id IN %(p)s \
OR loser_id IN %(p)s) \
AND tourney_name NOT LIKE 'Davis%%';""",connection,params={'p':players})
Many of the features we will develop will involve quantities derived from previous matches. Examples include weighted historical averages of winning first serves, double faults, etc. This section calculates some important quantities for each match.
Insert index on matches_hist
In [1711]:
matches_hist['key'] = np.arange(len(matches_hist))
Extract text string indicating unusual match outcomes
In [1712]:
# scores are just numbers, unless something weird happened. Extract comments about irregular outcomes.
t=matches_hist.score.str.extractall('(?P<comment>[a-zA-Z]+.+)').xs(0,level='match')
matches_hist = pd.merge(matches_hist,t,how='outer',left_index=True, right_index=True)
matches_hist.comment.unique()
Out[1712]:
Calculate games scores for each set, store in a separate dataframe
In [1713]:
# discard comments and trailing white space
scores = matches_hist.score.str.replace('(?P<comment>[a-zA-Z]+.+)','')
scores = scores.str.replace('(?P<comment>\([0-9]+\))','').str.strip()
# split the game scores into columns of a dataframe
scores = scores.str.split('-|\s',expand=True)
scores.columns=['W1','L1','W2','L2','W3','L3','W4','L4','W5','L5']
scores = scores.apply(lambda x: pd.to_numeric(x))
Store the number of games played in each match.
In [1714]:
ngames = np.sum(scores,axis=1)
matches_hist.insert(0,'ngames',ngames.astype('int'))
It seems a few matches were cut short for no recorded reason:
In [1715]:
# sanity check: are matches involving few than 12 games identical to those with commments?
idx1 = (ngames<12)
idx2 = matches_hist.comment.notnull()
z=(idx1*1)*(idx2*1-1)
zz = np.where(np.abs(z))[0]
print("matches with weird outcomes: ")
print(matches_hist.loc[zz,'score'])
Calcuate first serve percentages (fsp) for both winners and losers.
In [1716]:
matches_hist.insert(0,'w_fsp',matches_hist.w_1stIn/matches_hist.w_svpt)
matches_hist.insert(0,'l_fsp',matches_hist.l_1stIn/matches_hist.l_svpt)
matches_hist.loc[matches_hist.]
Calculate winning first serve percentages (wfs)
In [1717]:
matches_hist.insert(0,'w_wfs',matches_hist.w_1stWon/matches_hist.w_svpt)
matches_hist.insert(0,'l_wfs',matches_hist.l_1stWon/matches_hist.l_svpt)
Calculate second serves in (2ndIn) for both winners and losers
In [1718]:
matches_hist.insert(0,'w_2ndIn',matches_hist.w_svpt-matches_hist.w_df-matches_hist.w_1stIn)
matches_hist.insert(0,'l_2ndIn',matches_hist.l_svpt-matches_hist.l_df-matches_hist.l_1stIn)
Calculate second serve (ssp) percentages
In [1719]:
matches_hist.insert(0,'w_ssp',matches_hist.w_2ndIn/(matches_hist.w_2ndIn+matches_hist.w_df))
matches_hist.insert(0,'l_ssp',matches_hist.l_2ndIn/(matches_hist.l_2ndIn+matches_hist.l_df))
Calculate wining second serve percentages (wss)
In [1720]:
matches_hist.insert(0,'w_wss',matches_hist.w_2ndWon/matches_hist.w_2ndIn)
matches_hist.insert(0,'l_wss',matches_hist.l_2ndWon/matches_hist.l_2ndIn)
Calculate overall win on serve percentages (wsp)
In [1860]:
#matches_hist.insert(0,'w_wsp',(matches_hist.w_1stWon + matches_hist.w_2ndWon)/matches_hist.w_svpt)
#matches_hist.insert(0,'l_wsp',(matches_hist.l_1stWon+matches_hist.l_2ndWon)/matches_hist.l_svpt)
matches_hist['w_wsp']=(matches_hist.w_1stWon + matches_hist.w_2ndWon)/matches_hist.w_svpt
matches_hist['l_wsp']=(matches_hist.l_1stWon+matches_hist.l_2ndWon)/matches_hist.l_svpt
Calculate winning on return percentages (wrp).
[(# of opponent serves) - (# of opponent service victories)]/(# of opponent serves)
In [1722]:
matches_hist.insert(0,'w_wrp',(matches_hist.l_svpt - matches_hist.l_1stWon \
- matches_hist.l_2ndWon)/(matches_hist.l_svpt))
matches_hist.insert(0,'l_wrp',(matches_hist.w_svpt - matches_hist.w_1stWon \
- matches_hist.w_2ndWon)/(matches_hist.w_svpt))
Calculate total points won percentage (tpw)
In [1723]:
matches_hist.insert(0,'w_tpw',(matches_hist.l_svpt\
-matches_hist.l_1stWon-matches_hist.l_2ndWon\
+matches_hist.w_1stWon +matches_hist.w_2ndWon)/\
(matches_hist.l_svpt + matches_hist.w_svpt))
matches_hist.insert(0,'l_tpw',(matches_hist.w_svpt\
-matches_hist.w_1stWon-matches_hist.w_2ndWon\
+matches_hist.l_1stWon +matches_hist.l_2ndWon)/\
(matches_hist.l_svpt + matches_hist.w_svpt))
Calculate double faults per game (dfpg), per game
There are a couple of bad entries for SvGms. We'll first fix those. (Note: the fix evenly partions total games between both players. This could result in a fractional number of service games. My hunch is the total effect on the stats is minimal.)
In [1828]:
idx = np.where(((matches_hist.w_SvGms == 0)|(matches_hist.l_SvGms==0)) & (matches_hist.ngames >1))
print(matches_hist.loc[idx[0],['w_df','l_df','w_SvGms','l_SvGms','score','ngames']])
matches_hist.loc[idx[0],'w_SvGms'] = matches_hist.ngames[idx[0]]/2
matches_hist.loc[idx[0],'l_SvGms'] = matches_hist.ngames[idx[0]]/2
print(matches_hist.loc[idx[0],['w_df','l_df','w_SvGms','l_SvGms','score','ngames']])
In [1851]:
matches_hist.insert(0,'w_dfpg',matches_hist.w_df/matches_hist.w_SvGms)
matches_hist.insert(0,'l_dfpg',matches_hist.l_df/matches_hist.l_SvGms)
#matches_hist['w_dfpg']=matches_hist.w_df/matches_hist.w_SvGms
#matches_hist['l_dfpg']=matches_hist.l_df/matches_hist.l_SvGms
Calculate aces per game (acpg), per game
In [1855]:
matches_hist.insert(0,'w_acpg',matches_hist.w_ace/matches_hist.w_SvGms)
matches_hist.insert(0,'l_acpg',matches_hist.l_ace/matches_hist.l_SvGms)
#matches_hist['w_acpg']=matches_hist.w_ace/matches_hist.w_SvGms
#matches_hist['l_acpg']=matches_hist.l_ace/matches_hist.l_SvGms
Calculate break points saved percentage (bpsp)
In [ ]:
matches_hist.insert(0,'w_bps',matches_hist.w_bpSaved/matches_hist.w_bpFaced)
matches_hist.insert(0,'l_bps',matches_hist.l_bpSaved/matches_hist.l_bpFaced)
Flag games with premature closure, probably due to injury (retired)
In [1727]:
matches_hist.insert(0,'retired',0)
matches_hist.loc[(matches_hist.comment=='RET'),'retired']=1
Flag games won as a walkover (wo)
In [1728]:
matches_hist.insert(0,'walkover',0)
matches_hist.loc[(matches_hist.comment=='W/O'),'walkover']=1
Calculate player completeness (complete), defined as $$ wsp \times wrp $$
Note: might be more useful to aggregate first, rather than compute on a per-match basis.
In [1738]:
matches_hist.insert(0,'w_complete',matches_hist.w_wsp*matches_hist.w_wrp)
matches_hist.insert(0,'l_complete',matches_hist.l_wsp*matches_hist.l_wrp)
Calculate player service advantage (serveadv), defined as
$$ wsp_1 -wrp_2 $$Note: as with complete, it might be more useful to aggregated first
In [1739]:
matches_hist.insert(0,'w_serveadv',matches_hist.w_wsp-matches_hist.l_wrp)
matches_hist.insert(0,'l_serveadv',matches_hist.l_wsp-matches_hist.w_wrp)
Sanity check: investigate calculated quantities
In [1861]:
idx = matches_hist.comment.isnull()
labels = ['dfpg', 'acpg', 'tpw', 'wrp', 'wsp', 'wss', 'wfs', 'fsp', 'ssp', 'bps','complete']
for label in labels:
printstr = label + ": max for winner/loser is {:5.2f}/{:5.2f}, min for winner/loser is {:5.2f}/{:5.2f}"
v1 = eval('matches_hist.w_' + label + '[idx].max()')
v2 = eval('matches_hist.l_' + label + '[idx].max()')
v3 = eval('matches_hist.w_' + label + '[idx].min()')
v4 = eval('matches_hist.l_' + label + '[idx].min()')
print(printstr.format(v1,v2,v3,v4))
Extract winner and loser data separately and concatenate into dataframe all_records with uniform column names.
In [1862]:
# extract winner stats
w_records = matches_hist[['winner_id',
'tourney_date',
'tourney_id',
'match_num',
'ngames',
'key',
'w_acpg', # avg. no of aces per game
'w_dfpg', # avg no. of double faults per game
'w_tpw', # total points won
'w_wrp', # wining return percent
'w_wsp', # winning service percent
'w_wss', # winning second serve percent
'w_wfs', # winning first serve percent
'w_fsp', # good first serves percent
'w_ssp', # good second serves percent
'w_bps', # breakpoints saved percent
'retired',# 1 if loser retired prematurely
'walkover', # 1 if loser didn't show up
'surface', # 'Hard', 'Clay', or 'Grass'
'winner_age', # age
'winner_ht', # height
'winner_rank', # rank
'winner_rank_points' # rank points
]]
# rename columns
newcols = {'winner_id':'pid',
'tourney_date':'date',
'tourney_id':'tid',
'match_num':'mid',
'ngames':'ngames',
'key':'key',
'w_acpg':'acpg',
'w_dfpg':'dfpg',
'w_tpw':'tpw',
'w_wrp':'wrp',
'w_wsp':'wsp',
'w_wss':'wss',
'w_wfs':'wfs',
'w_fsp':'fsp',
'w_ssp':'ssp',
'w_bps':'bps',
'retired':'retired',
'walkover':'walkover',
'surface':'surface',
'winner_age':'age',
'winner_ht':'ht',
'winner_rank':'rank',
'winner_rank_points':'rank_points'
}
w_records = w_records.rename(columns = newcols)
# record that the outcome was a victory for these players
w_records['outcome'] = np.ones(len(w_records))
# extract loser stats
l_records = matches_hist[['loser_id',
'tourney_date',
'tourney_id',
'match_num',
'ngames',
'key',
'l_acpg', # avg. no of aces per game
'l_dfpg', # avg no. of double faults per game
'l_tpw', # total points won
'l_wrp', # wining return percent
'l_wsp', # winning service percent
'l_wss', # winning second serve percent
'l_wfs', # winning first serve percent
'l_fsp', # percent of successful first serves
'l_ssp', # percent of successful second serves
'l_bps', # percent of breakpoints saved
'retired',# 1 if loser retired prematurely
'walkover',# 1 if loser didn't show up
'surface', # 'Hard', 'Clay', or 'Grass'
'loser_age', # age
'loser_ht', # height
'loser_rank', # rank
'loser_rank_points' # rank points
]]
# rename columns
newcols = {'loser_id':'pid',
'tourney_date':'date',
'tourney_id':'tid',
'match_num':'mid',
'ngames':'ngames',
'key':'key',
'l_acpg':'acpg',
'l_dfpg':'dfpg',
'l_tpw':'tpw',
'l_wrp':'wrp',
'l_wsp':'wsp',
'l_wss':'wss',
'l_wfs':'wfs',
'l_fsp':'fsp',
'l_ssp':'ssp',
'l_bps':'bps',
'retired':'retired',
'walkover':'walkover',
'surface':'surface',
'loser_age':'age',
'loser_ht':'ht',
'loser_rank':'rank',
'loser_rank_points':'rank_points'
}
l_records = l_records.rename(columns = newcols)
# record outcome as a loss
l_records['outcome'] = np.zeros(len(w_records))
# fuse all the data into one dataframe
all_records = pd.concat([w_records,l_records]).reset_index().sort_values(['key']).replace(np.inf,np.nan)
Calculate surface weighting matrix. Note that the resulting values are quite different from Sipko's.
In [1863]:
grouped = all_records.groupby(['pid','surface'])
t=grouped['outcome'].mean()
surf_wt = t.unstack(level=-1).corr()
surf_wt
Out[1863]:
Function to calculate static features. (Specifically, calculate normalized rank, rankpts, age, height, hand features for every match in the dataset. Operates on the data as a whole, rather than by line.)
In [1914]:
def get_static_features(data):
"""
Description: returns differences of those features that don't depend on match histories.
(Rank, Rankpoints, Height, Hand)
Input: dataframe with all the match data for which features are to be calculated
Output: another dataframe of the same length but only four columns, each with one feature
"""
# boolean, 1 means (winner,loser) are (player1,player2), 0 means the reverse
outcome = data['outcome']
# features dataframe should include merge identifiers
features = data[['outcome','key']].copy()
# rank (normalize)
rank=(data.loser_rank-data.winner_rank)*(-1)**outcome
features.insert(0,'rank',rank/rank.std())
# rank points (normalize)
rankpts = (data.loser_rank_points-data.winner_rank_points)*(-1)**outcome
features.insert(0,'rankpts',rankpts/rankpts.std())
# height (normalize)
height = (data.loser_ht-data.winner_ht)*(-1)**outcome
features.insert(0,'height',height/height.std())
# age (normalize)
height = (data.loser_age-data.winner_age)*(-1)**outcome
features.insert(0,'age',height/height.std())
# hand (1 for right, 0 for left)
hand = ((data.loser_hand=='R')*1-(data.winner_hand=='R')*1)*(-1)**outcome
hand.iloc[np.where((data['winner_hand']=='U')|\
(data['loser_hand']=='U'))[0]]=np.nan
features.insert(0,'hand',hand)
return features
Function to get dynamic features (i.e. all features that require some time averaging.)
In [1894]:
def get_dynamic_features(x):
"""
Input: a row of the dataframe. Needs to have the following fields:
pid (player id number)
tid (tournament id number)
mid (match id number)
date (match date)
surface (match surface)
Output: a dataframe of two columns, one a time discount, the other a surface discount
"""
# extract identifiers and date from input row
pid = x['pid'] # player id
tid = x['tid'] # tourney id
mid = x['mid'] # match id
date = x['date']
surface = x['surface']
# extract all historical records for this player, from before this match
records = all_records.loc[(all_records.pid==pid) & (all_records.date <= date) &\
((all_records.tid != tid) | (all_records.mid != mid)),:].copy()
# get time discount factor
p = 0.8
t = (date - records.date).apply(lambda x: x.days/365)
t_wt = p**t
t_wt.loc[t_wt>p]=p
# get surface discount factor
s_wt = records.surface.apply(lambda x: surf_wt.loc[x,surface])
# get time and court weighted averages of serve and performance stats
t = records[['dfpg','acpg','tpw','wrp','wsp',\
'wss','wfs','fsp','ssp','bps']].mul(t_wt*s_wt,axis=0).sum(axis=0)/\
records[['dfpg','acpg','tpw','wrp','wsp',\
'wss','wfs','fsp','ssp','bps']].notnull().mul(t_wt*s_wt,axis=0).sum(axis=0)
if len(records)==0:
t['complete']=np.nan
t['retired']=np.nan
return t
# get player completeness
t['complete'] = t['wsp']*t['wrp']
# get player serveadvantage
t['serveadv'] = t['wsp']+t['wrp']
# get player "return from retirement" status
t['retired'] = records.loc[records.date==records.date.min(),'retired'].values[0]
# return a series
return t
The get_dynamic_features function needs to be called twice for each row of the dataset, once for each player. This wrapper function orchestrates the calls.
In [1907]:
def dynamic_feature_wrapper(x):
"""
calls "get_dynamic_features" to extract dynamic features for each player
"""
pids = x[['lid','wid']]
y = x.copy()
# get Player1 info
y['pid'] = pids[y['outcome']]
P1_features = get_dynamic_features(y)
# get Player0 info
y['pid'] = pids[1-y['outcome']]
P2_features = get_dynamic_features(y)
# features are differences
features = P1_features - P2_features
# record the key for subsequent merging
features['key'] = y['key']
return features
Insert a binary column outcomes into data; randomly assigns winner and loser "player 1" and "player 2" status.
In [ ]:
data['outcome']=np.random.choice([0,1],size=len(features))
Now get the features
In [ ]:
import timeit
start_time = timeit.default_timer()
# extract and rename some columns of the data, for compatibility with the get_dynamic_features function
x=data[['tourney_id','match_num','tourney_date','key','winner_id','loser_id','surface','outcome']].copy()
x.rename(columns={'tourney_id':'tid','match_num':'mid','tourney_date':'date',\
'winner_id':'wid','loser_id':'lid'},inplace=True)
# get both static and dynamic feaatures
s_features=get_static_features(data)
d_features = x.apply(dynamic_feature_wrapper,axis=1)
elapsed = timeit.default_timer() - start_time
Merge static and dynamic feature
In [1946]:
features = pd.merge(s_features,d_features,how='inner',on='key')
Write features to disk.
In [1951]:
pd.to_pickle(features,pickle_dir + 'features.pkl')
features.to_csv(pickle_dir+'features.csv')
Scope out the distribution of the features
In [1932]:
features.loc[features.outcome==0,'retired'].mean()
Out[1932]:
In [1945]:
# define figure and axes
fig = plt.figure(figsize=(15,5))
ax0 = fig.add_subplot(121)
ax1 = fig.add_subplot(122)
ax0.hist(features['age'].dropna())
ax0.set_title('Diff. in rank')
ax1.hist(features.rankpts.dropna())
ax1.set_title('Diff in rank pts')
Out[1945]:
Proposed features:
upsets (losing when higher ranked, winning when lower ranked)Parameters to solve for:
Take-aways: