Extracting features
Author: Carl Toews
File: extract_features.ipynb
In order to implement machine learning algorithms, we need to develop a set of informative features. Following Machine learning for predicting professional tennis matches [Sipko, 2015], for each match we assign one player to be "Player 0" and the other to be "Player 1", and call the outcome a 0 if Player 0 won and a 1 otherwise. For each match, we produce a set of features, each a measure of difference between some characteristic of the two players. The characteristics we consider include the following (many of these are from Sipko):
True if 1st match back since retirement)Note that rank, rankpts, height, and hand can be read from player data directly with no calculation. On the other hand, fsp, wfs, wss, wsp, wrp, tpw, acpg, dfpg, bps can be calculated for any given match, but for purposes of predicting a future match, need to be averaged over the historical record. Finally, tmw, retired, fatigue, complete, serveadv, and direct are not calculated "match-by-match", but rather derived from the historical record.
In order to derive these features, we'll first need to clean the data a bit. Specifically, we need to deal with missing or null values, as well as rectify incorrect values. Examples of issues include:
Import statements
In [474]:
import sqlalchemy # pandas-mysql interface library
import sqlalchemy.exc # exception handling
from sqlalchemy import create_engine # needed to define db interface
import sys # for defining behavior under errors
import numpy as np # numerical libraries
import scipy as sp
import pandas as pd # for data analysis
import pandas.io.sql as sql # for interfacing with MySQL database
import matplotlib as mpl # a big library with plotting functionality
import matplotlib.pyplot as plt # a subset of matplotlib with most of the useful tools
import IPython as IP
%matplotlib inline
import pdb
#%qtconsole
Player and odds data from 2010-2016 has beeen matched and stored. Retrieve, merge, and rename.
In [973]:
pickle_dir = '../pickle_files/'
odds_file = 'odds.pkl'
matches_file = 'matches.pkl'
odds= pd.read_pickle(pickle_dir + odds_file)
matches= pd.read_pickle(pickle_dir + matches_file)
data = pd.merge(matches,odds[['PSW','PSL','key']],how='inner',on='key')
Get additional training data. We'll include data from 2005, excluding Davis Cup matches, and focusing exclusively on players who played in our 2010-2016 set.
In [1710]:
# name of database
db_name = "tennis"
# name of db user
username = "testuser"
# db password for db user
password = "test623"
# location of atp data files
atpfile_directory = "../data/tennis_atp-master/"
# focus on most recent data; exclude Davis Cup stuff
startdate = '20050101'
enddate = '20161231'
engine = create_engine('mysql+mysqldb://' + username + ':' + password + '@localhost/' + db_name)
# get unique winners and losers in our set
players = tuple(pd.concat((data.winner_id,data.loser_id)).unique())
# load all data pertinant for any player
with engine.begin() as connection:
matches_hist = pd.read_sql_query("""SELECT * FROM matches \
WHERE tourney_date >= '""" + startdate + """' \
AND tourney_date <= '""" + enddate + """' \
AND (winner_id IN %(p)s \
OR loser_id IN %(p)s) \
AND tourney_name NOT LIKE 'Davis%%';""",connection,params={'p':players})
Many of the features we will develop will involve quantities derived from previous matches. Examples include weighted historical averages of winning first serves, double faults, etc. This section calculates some important quantities for each match.
Insert index on matches_hist
In [1711]:
matches_hist['key'] = np.arange(len(matches_hist))
Extract text string indicating unusual match outcomes
In [1712]:
# scores are just numbers, unless something weird happened. Extract comments about irregular outcomes.
t=matches_hist.score.str.extractall('(?P<comment>[a-zA-Z]+.+)').xs(0,level='match')
matches_hist = pd.merge(matches_hist,t,how='outer',left_index=True, right_index=True)
matches_hist.comment.unique()
Out[1712]:
Calculate games scores for each set, store in a separate dataframe
In [1713]:
# discard comments and trailing white space
scores = matches_hist.score.str.replace('(?P<comment>[a-zA-Z]+.+)','')
scores = scores.str.replace('(?P<comment>\([0-9]+\))','').str.strip()
# split the game scores into columns of a dataframe
scores = scores.str.split('-|\s',expand=True)
scores.columns=['W1','L1','W2','L2','W3','L3','W4','L4','W5','L5']
scores = scores.apply(lambda x: pd.to_numeric(x))
Store the number of games played in each match.
In [1714]:
ngames = np.sum(scores,axis=1)
matches_hist.insert(0,'ngames',ngames.astype('int'))
It seems a few matches were cut short for no recorded reason:
In [1715]:
# sanity check: are matches involving few than 12 games identical to those with commments?
idx1 = (ngames<12)
idx2 = matches_hist.comment.notnull()
z=(idx1*1)*(idx2*1-1)
zz = np.where(np.abs(z))[0]
print("matches with weird outcomes: ")
print(matches_hist.loc[zz,'score'])
Calcuate first serve percentages (fsp) for both winners and losers.
In [1716]:
matches_hist.insert(0,'w_fsp',matches_hist.w_1stIn/matches_hist.w_svpt)
matches_hist.insert(0,'l_fsp',matches_hist.l_1stIn/matches_hist.l_svpt)
matches_hist.loc[matches_hist.]
Calculate winning first serve percentages (wfs)
In [1717]:
matches_hist.insert(0,'w_wfs',matches_hist.w_1stWon/matches_hist.w_svpt)
matches_hist.insert(0,'l_wfs',matches_hist.l_1stWon/matches_hist.l_svpt)
Calculate second serves in (2ndIn) for both winners and losers
In [1718]:
matches_hist.insert(0,'w_2ndIn',matches_hist.w_svpt-matches_hist.w_df-matches_hist.w_1stIn)
matches_hist.insert(0,'l_2ndIn',matches_hist.l_svpt-matches_hist.l_df-matches_hist.l_1stIn)
Calculate second serve (ssp) percentages
In [1719]:
matches_hist.insert(0,'w_ssp',matches_hist.w_2ndIn/(matches_hist.w_2ndIn+matches_hist.w_df))
matches_hist.insert(0,'l_ssp',matches_hist.l_2ndIn/(matches_hist.l_2ndIn+matches_hist.l_df))
Calculate wining second serve percentages (wss)
In [1720]:
matches_hist.insert(0,'w_wss',matches_hist.w_2ndWon/matches_hist.w_2ndIn)
matches_hist.insert(0,'l_wss',matches_hist.l_2ndWon/matches_hist.l_2ndIn)
Calculate overall win on serve percentages (wsp)
In [1860]:
#matches_hist.insert(0,'w_wsp',(matches_hist.w_1stWon + matches_hist.w_2ndWon)/matches_hist.w_svpt)
#matches_hist.insert(0,'l_wsp',(matches_hist.l_1stWon+matches_hist.l_2ndWon)/matches_hist.l_svpt)
matches_hist['w_wsp']=(matches_hist.w_1stWon + matches_hist.w_2ndWon)/matches_hist.w_svpt
matches_hist['l_wsp']=(matches_hist.l_1stWon+matches_hist.l_2ndWon)/matches_hist.l_svpt
Calculate winning on return percentages (wrp).
[(# of opponent serves) - (# of opponent service victories)]/(# of opponent serves)
In [1722]:
matches_hist.insert(0,'w_wrp',(matches_hist.l_svpt - matches_hist.l_1stWon \
- matches_hist.l_2ndWon)/(matches_hist.l_svpt))
matches_hist.insert(0,'l_wrp',(matches_hist.w_svpt - matches_hist.w_1stWon \
- matches_hist.w_2ndWon)/(matches_hist.w_svpt))
Calculate total points won percentage (tpw)
In [1723]:
matches_hist.insert(0,'w_tpw',(matches_hist.l_svpt\
-matches_hist.l_1stWon-matches_hist.l_2ndWon\
+matches_hist.w_1stWon +matches_hist.w_2ndWon)/\
(matches_hist.l_svpt + matches_hist.w_svpt))
matches_hist.insert(0,'l_tpw',(matches_hist.w_svpt\
-matches_hist.w_1stWon-matches_hist.w_2ndWon\
+matches_hist.l_1stWon +matches_hist.l_2ndWon)/\
(matches_hist.l_svpt + matches_hist.w_svpt))
Calculate double faults per game (dfpg), per game
There are a couple of bad entries for SvGms. We'll first fix those. (Note: the fix evenly partions total games between both players. This could result in a fractional number of service games. My hunch is the total effect on the stats is minimal.)
In [1828]:
idx = np.where(((matches_hist.w_SvGms == 0)|(matches_hist.l_SvGms==0)) & (matches_hist.ngames >1))
print(matches_hist.loc[idx[0],['w_df','l_df','w_SvGms','l_SvGms','score','ngames']])
matches_hist.loc[idx[0],'w_SvGms'] = matches_hist.ngames[idx[0]]/2
matches_hist.loc[idx[0],'l_SvGms'] = matches_hist.ngames[idx[0]]/2
print(matches_hist.loc[idx[0],['w_df','l_df','w_SvGms','l_SvGms','score','ngames']])
In [1851]:
matches_hist.insert(0,'w_dfpg',matches_hist.w_df/matches_hist.w_SvGms)
matches_hist.insert(0,'l_dfpg',matches_hist.l_df/matches_hist.l_SvGms)
#matches_hist['w_dfpg']=matches_hist.w_df/matches_hist.w_SvGms
#matches_hist['l_dfpg']=matches_hist.l_df/matches_hist.l_SvGms
Calculate aces per game (acpg), per game
In [1855]:
matches_hist.insert(0,'w_acpg',matches_hist.w_ace/matches_hist.w_SvGms)
matches_hist.insert(0,'l_acpg',matches_hist.l_ace/matches_hist.l_SvGms)
#matches_hist['w_acpg']=matches_hist.w_ace/matches_hist.w_SvGms
#matches_hist['l_acpg']=matches_hist.l_ace/matches_hist.l_SvGms
Calculate break points saved percentage (bpsp)
In [ ]:
matches_hist.insert(0,'w_bps',matches_hist.w_bpSaved/matches_hist.w_bpFaced)
matches_hist.insert(0,'l_bps',matches_hist.l_bpSaved/matches_hist.l_bpFaced)
Flag games with premature closure, probably due to injury (retired)
In [1727]:
matches_hist.insert(0,'retired',0)
matches_hist.loc[(matches_hist.comment=='RET'),'retired']=1
Flag games won as a walkover (wo)
In [1728]:
matches_hist.insert(0,'walkover',0)
matches_hist.loc[(matches_hist.comment=='W/O'),'walkover']=1
Calculate player completeness (complete), defined as $$ wsp \times wrp $$
Note: might be more useful to aggregate first, rather than compute on a per-match basis.
In [1738]:
matches_hist.insert(0,'w_complete',matches_hist.w_wsp*matches_hist.w_wrp)
matches_hist.insert(0,'l_complete',matches_hist.l_wsp*matches_hist.l_wrp)
Calculate player service advantage (serveadv), defined as
$$ wsp_1 -wrp_2 $$Note: as with complete, it might be more useful to aggregated first
In [1739]:
matches_hist.insert(0,'w_serveadv',matches_hist.w_wsp-matches_hist.l_wrp)
matches_hist.insert(0,'l_serveadv',matches_hist.l_wsp-matches_hist.w_wrp)
Sanity check: investigate calculated quantities
In [1861]:
idx = matches_hist.comment.isnull()
labels = ['dfpg', 'acpg', 'tpw', 'wrp', 'wsp', 'wss', 'wfs', 'fsp', 'ssp', 'bps','complete']
for label in labels:
printstr = label + ": max for winner/loser is {:5.2f}/{:5.2f}, min for winner/loser is {:5.2f}/{:5.2f}"
v1 = eval('matches_hist.w_' + label + '[idx].max()')
v2 = eval('matches_hist.l_' + label + '[idx].max()')
v3 = eval('matches_hist.w_' + label + '[idx].min()')
v4 = eval('matches_hist.l_' + label + '[idx].min()')
print(printstr.format(v1,v2,v3,v4))
Extract winner and loser data separately and concatenate into dataframe all_records with uniform column names.
In [1862]:
# extract winner stats
w_records = matches_hist[['winner_id',
'tourney_date',
'tourney_id',
'match_num',
'ngames',
'key',
'w_acpg', # avg. no of aces per game
'w_dfpg', # avg no. of double faults per game
'w_tpw', # total points won
'w_wrp', # wining return percent
'w_wsp', # winning service percent
'w_wss', # winning second serve percent
'w_wfs', # winning first serve percent
'w_fsp', # good first serves percent
'w_ssp', # good second serves percent
'w_bps', # breakpoints saved percent
'retired',# 1 if loser retired prematurely
'walkover', # 1 if loser didn't show up
'surface', # 'Hard', 'Clay', or 'Grass'
'winner_age', # age
'winner_ht', # height
'winner_rank', # rank
'winner_rank_points' # rank points
]]
# rename columns
newcols = {'winner_id':'pid',
'tourney_date':'date',
'tourney_id':'tid',
'match_num':'mid',
'ngames':'ngames',
'key':'key',
'w_acpg':'acpg',
'w_dfpg':'dfpg',
'w_tpw':'tpw',
'w_wrp':'wrp',
'w_wsp':'wsp',
'w_wss':'wss',
'w_wfs':'wfs',
'w_fsp':'fsp',
'w_ssp':'ssp',
'w_bps':'bps',
'retired':'retired',
'walkover':'walkover',
'surface':'surface',
'winner_age':'age',
'winner_ht':'ht',
'winner_rank':'rank',
'winner_rank_points':'rank_points'
}
w_records = w_records.rename(columns = newcols)
# record that the outcome was a victory for these players
w_records['outcome'] = np.ones(len(w_records))
# extract loser stats
l_records = matches_hist[['loser_id',
'tourney_date',
'tourney_id',
'match_num',
'ngames',
'key',
'l_acpg', # avg. no of aces per game
'l_dfpg', # avg no. of double faults per game
'l_tpw', # total points won
'l_wrp', # wining return percent
'l_wsp', # winning service percent
'l_wss', # winning second serve percent
'l_wfs', # winning first serve percent
'l_fsp', # percent of successful first serves
'l_ssp', # percent of successful second serves
'l_bps', # percent of breakpoints saved
'retired',# 1 if loser retired prematurely
'walkover',# 1 if loser didn't show up
'surface', # 'Hard', 'Clay', or 'Grass'
'loser_age', # age
'loser_ht', # height
'loser_rank', # rank
'loser_rank_points' # rank points
]]
# rename columns
newcols = {'loser_id':'pid',
'tourney_date':'date',
'tourney_id':'tid',
'match_num':'mid',
'ngames':'ngames',
'key':'key',
'l_acpg':'acpg',
'l_dfpg':'dfpg',
'l_tpw':'tpw',
'l_wrp':'wrp',
'l_wsp':'wsp',
'l_wss':'wss',
'l_wfs':'wfs',
'l_fsp':'fsp',
'l_ssp':'ssp',
'l_bps':'bps',
'retired':'retired',
'walkover':'walkover',
'surface':'surface',
'loser_age':'age',
'loser_ht':'ht',
'loser_rank':'rank',
'loser_rank_points':'rank_points'
}
l_records = l_records.rename(columns = newcols)
# record outcome as a loss
l_records['outcome'] = np.zeros(len(w_records))
# fuse all the data into one dataframe
all_records = pd.concat([w_records,l_records]).reset_index().sort_values(['key']).replace(np.inf,np.nan)
Calculate surface weighting matrix. Note that the resulting values are quite different from Sipko's.
In [1863]:
grouped = all_records.groupby(['pid','surface'])
t=grouped['outcome'].mean()
surf_wt = t.unstack(level=-1).corr()
surf_wt
Out[1863]:
Function to calculate static features. (Specifically, calculate normalized rank, rankpts, age, height, hand features for every match in the dataset. Operates on the data as a whole, rather than by line.)
In [1869]:
def get_static_features(data):
"""
Description: returns differences of those features that don't depend on match histories.
(Rank, Rankpoints, Height, Hand)
Input: dataframe with all the match data for which features are to be calculated
Output: another dataframe of the same length but only four columns, each with one feature
"""
# boolean, 1 means (winner,loser) are (player1,player2), 0 means the reverse
outcome = data['outcome']
# features dataframe should include merge identifiers
features = data[['tourney_id', 'match_num','tourney_date','key']].copy()
# rank (normalize)
rank=(data.loser_rank-data.winner_rank)*(-1)**outcome
features.insert(0,'rank',rank/rank.std())
# rank points (normalize)
rankpts = (data.loser_rank_points-data.winner_rank_points)*(-1)**outcome
features.insert(0,'rankpts',rankpts/rankpts.std())
# height (normalize)
height = (data.loser_ht-data.winner_ht)*(-1)**outcome
features.insert(0,'height',height/height.std())
# age (normalize)
height = (data.loser_age-data.winner_age)*(-1)**outcome
features.insert(0,'age',height/height.std())
# hand (1 for right, 0 for left)
hand = ((data.loser_hand=='R')*1-(data.winner_hand=='R')*1)*(-1)**outcome
hand.iloc[np.where((data['winner_hand']=='U')|\
(data['loser_hand']=='U'))[0]]=np.nan
features.insert(0,'hand',hand)
return features
Get dynamic features (i.e. all features that require some time averaging.)
In [1879]:
def get_dynamic_features(x):
"""
Input: a row of the dataframe. Needs to have the following fields:
pid (player id number)
tid (tournament id number)
mid (match id number)
date (match date)
surface (match surface)
Output: a dataframe of two columns, one a time discount, the other a surface discount
"""
# extract identifiers and date from input row
pid = x['pid'] # player id
tid = x['tid'] # tourney id
mid = x['mid'] # match id
date = x['date']
surface = x['surface']
# extract all historical records for this player, from before this match
records = all_records.loc[(all_records.pid==pid) & (all_records.date <= date) &\
((all_records.tid != tid) | (all_records.mid != mid)),:].copy()
# get time discount factor
p = 0.8
t = (date - records.date).apply(lambda x: x.days/365)
t_wt = p**t
t_wt.loc[t_wt>p]=p
# get surface discount factor
s_wt = records.surface.apply(lambda x: surf_wt.loc[x,surface])
# get time and court weighted averages of serve and performance stats
t = records[['dfpg','acpg','tpw','wrp','wsp',\
'wss','wfs','fsp','ssp','bps']].mul(t_wt*s_wt,axis=0).sum(axis=0)/\
records[['dfpg','acpg','tpw','wrp','wsp',\
'wss','wfs','fsp','ssp','bps']].notnull().mul(t_wt*s_wt,axis=0).sum(axis=0)
if len(records)==0:
t['complete']=np.nan
t['retired']=np.nan
return t
# get player completeness
t['complete'] = t['wsp']*t['wrp']
# get player serveadvantage
t['serveadv'] = t['wsp']+t['wrp']
# get player "return from retirement" status
t['retired'] = records.loc[records.date==records.date.min(),'retired'].values[0]
# return a series
return t
In [1535]:
def dynamic_feature_wrapper(x):
"""
calls "get_dynamic_features" to extract dynamic features for each player
"""
pids = x[['lid','wid']]
y = x.copy()
# get Player1 info
y['pid'] = pids[y['outcome']]
P1_features = get_dynamic_features(y)
# get Player0 info
y['pid'] = pids[1-y['outcome']]
P2_features = get_dynamic_features(y)
# features are differences
features = P1_features - P2_features
# compute service advantage
features['serveadv'] =
return
In [1870]:
data['outcome']=np.random.choice([0,1],size=len(features))
s_features=get_static_features(data)
s_features
Out[1870]:
In [1887]:
x=data[['tourney_id','match_num','tourney_date','key','winner_id','loser_id','surface','outcome']].copy()
x.rename(columns={'tourney_id':'tid','match_num':'mid','tourney_date':'date',\
'winner_id':'wid','loser_id':'lid'},inplace=True)
In [1892]:
x.iloc[0:5,:].apply(dynamic_feature_wrapper,axis=1)
#x.iloc[0:5,:]
Out[1892]:
In [1886]:
y = matches_hist.iloc[15000]
dynamic_feature_wrapper(y)
In [1883]:
x = all_records.iloc[13002]
records = get_dynamic_features(x)
records
Out[1883]:
It will be help with indexing if we isolate all relevant features for players 1 and 0 into their own dataframes. This is what the following code does.
In [408]:
# initialize dataframes to hold features for players 1 and 0
P1 = pd.DataFrame(columns=['DATE','TID','MID','PID','HAND','HT',\
'AGE','RANKPTS','RANK','ACE','DF','SVPT',\
'FSTIN','FSTWON','SNDWON','BPSAVED','BPFACED'])
P0 = pd.DataFrame(columns=['DATE','TID','MID','PID','HAND','HT',
'AGE','RANKPTS','RANK','ACE','DF','SVPT',\
'FSTIN','FSTWON','SNDWON','BPSAVED','BPFACED'])
# define a function that returns winner info if RES=1, otherwise loser info
def assign_player_1(x):
winner = pd.Series({'DATE':x['tourney_date'],\
'TID':x['tourney_id'],\
'MID':x['match_num'],\
'PID':x['winner_id'],\
'HAND':x['winner_hand'],\
'HT':x['winner_ht'],\
'AGE':x['winner_age'],\
'RANKPTS':x['winner_rank_points'],\
'RANK':x['winner_rank'],\
'ACE':x['w_ace'],\
'DF':x['w_df'],\
'SVPT':x['w_svpt'],\
'FSTIN':x['w_1stIn'],\
'FSTWON':x['w_1stWon'],\
'SNDWON':x['w_2ndWon'],\
'BPSAVED':x['w_bpSaved'],\
'BPFACED':x['w_bpFaced']})
loser = pd.Series({'DATE':x['tourney_date'],\
'TID':x['tourney_id'],\
'MID':x['match_num'],\
'PID':x['loser_id'],\
'HAND':x['loser_hand'],\
'HT':x['loser_ht'],\
'AGE':x['loser_age'],\
'RANKPTS':x['loser_rank_points'],\
'RANK':x['loser_rank'],\
'ACE':x['l_ace'],\
'DF':x['l_df'],\
'SVPT':x['l_svpt'],\
'FSTIN':x['l_1stIn'],\
'FSTWON':x['l_1stWon'],\
'SNDWON':x['l_2ndWon'],\
'BPSAVED':x['l_bpSaved'],\
'BPFACED':x['l_bpFaced']})
if x['RES']==1:
return winner
else:
return loser
# mutatis mutandis for player 0. (Note: no need to rewrite this function if I can figure
# out how to assign two outputs within an "apply" call.)
def assign_player_0(x):
winner = pd.Series({'DATE':x['tourney_date'],\
'TID':x['tourney_id'],\
'MID':x['match_num'],\
'PID':x['winner_id'],\
'HAND':x['winner_hand'],\
'HT':x['winner_ht'],\
'AGE':x['winner_age'],\
'RANKPTS':x['winner_rank_points'],\
'RANK':x['winner_rank'],\
'ACE':x['w_ace'],\
'DF':x['w_df'],\
'SVPT':x['w_svpt'],\
'FSTIN':x['w_1stIn'],\
'FSTWON':x['w_1stWon'],\
'SNDWON':x['w_2ndWon'],\
'BPSAVED':x['w_bpSaved'],\
'BPFACED':x['w_bpFaced']})
loser = pd.Series({'DATE':x['tourney_date'],\
'TID':x['tourney_id'],\
'MID':x['match_num'],\
'PID':x['loser_id'],\
'HAND':x['loser_hand'],\
'HT':x['loser_ht'],\
'AGE':x['loser_age'],\
'RANKPTS':x['loser_rank_points'],\
'RANK':x['loser_rank'],\
'ACE':x['l_ace'],\
'DF':x['l_df'],\
'SVPT':x['l_svpt'],\
'FSTIN':x['l_1stIn'],\
'FSTWON':x['l_1stWon'],\
'SNDWON':x['l_2ndWon'],\
'BPSAVED':x['l_bpSaved'],\
'BPFACED':x['l_bpFaced']})
if x['RES']==1:
return loser
else:
return winner
In [409]:
matches_hist.insert(len(matches_hist.columns),'RES',features['RES'].values)
P1=matches_hist.apply(assign_player_1,axis=1)
P0=matches_hist.apply(assign_player_0,axis=1)
Features I and II: differences of ranks and rank points (RPTS)
We'll scale the rank point differences by the standard deviation.
In [ ]:
features.insert(len(features.columns), 'RANKPTS', P1['RANKPTS']-P0['RANKPTS'])
features.insert(len(features.columns), 'RANK', P1['RANK']-P0['RANK'])
features['RANKPTS'] = features['RANKPTS']/features['RANKPTS'].std()
features['RANK'] = features['RANK']/features['RANK'].std()
In [450]:
# define figure and axes
fig = plt.figure(figsize=(15,5))
ax0 = fig.add_subplot(121)
ax1 = fig.add_subplot(122)
ax0.hist(features.RANK.dropna())
ax0.set_title('Diff. in rank')
ax1.hist(features.RANKPTS.dropna())
ax1.set_title('Diff in rank pts')
Out[450]:
Feature III: differences of first serve winning percentage
In [426]:
P1.insert(len(P1.columns),'FSWPCT',P1['FSTWON']/P1['SVPT'])
P0.insert(len(P0.columns),'FSWPCT',P0['FSTWON']/P0['SVPT'])
P1_grouped = P1.groupby('PID')
P0_grouped = P0.groupby('PID')
def extract_features(group):
mean_fswpct = group['FSWPCT'].mean()
size = len(group)
return pd.Series({'mean_fswpct':mean_fswpct,'size':size})
t1=P1_grouped.apply(extract_features).reset_index()
t0=P0_grouped.apply(extract_features).reset_index()
t2 = pd.merge(t1,t0,how='outer',on='PID')
t2 = t2.fillna(0)
t2['FSWPCT_HIST'] = (t2['mean_fswpct_x']*t2['size_x'] +\
t2['mean_fswpct_y']*t2['size_y'])/(t2['size_x']+t2['size_y'])
In [428]:
P1=pd.merge(P1,t2[['PID','FSWPCT_HIST']],how='inner',on='PID')
P0=pd.merge(P0,t2[['PID','FSWPCT_HIST']],how='inner',on='PID')
In [ ]:
features['FSWPCT']=P1['FSWPCT']-P0['FSWPCT']
In [451]:
plt.hist(features.FSWPCT.dropna())
plt.title('Diff. in first serve winning percentages')
Out[451]:
Feature 4: Height differences
In [ ]:
features['HT'] = P1['HT']-P2['HT']
features['HT'] = features['HT']/features['HT'].std()
In [452]:
plt.hist(features.HT.dropna())
plt.title('Difference in height')
Out[452]:
Feature 5: Age differences
In [ ]:
features['AGE'] = P1['AGE']-P2['AGE']
features['AGE'] = features['AGE']/features['AGE'].std()
In [456]:
plt.hist(features.AGE.dropna())
plt.title('Difference in age')
Out[456]:
Proposed features:
upsets (losing when higher ranked, winning when lower ranked)Parameters to solve for:
Take-aways:
In [444]:
P1=pd.merge(P1,t2[['PID','SSWPCT_HIST']],how='inner',on='PID')
P0=pd.merge(P0,t2[['PID','SSWPCT_HIST']],how='inner',on='PID')