Extracting features

Author: Carl Toews
File: extract_features.ipynb

Description:

In order to implement machine learning algorithms, we need to develop a set of informative features. Following Machine learning for predicting professional tennis matches [Sipko, 2015], for each match we assign one player to be "Player 0" and the other to be "Player 1", and call the outcome a 0 if Player 0 won and a 1 otherwise. For each match, we produce a set of features, each a measure of difference between some characteristic of the two players. The characteristics we consider include the following (many of these are from Sipko):

  • rank: rank
  • rankpts: rank points
  • height: height
  • hand: handedness binary (0 for right, 1 for left)
  • fsp: "first serve is valid" (percentage)
  • wfs: winning first serve (percentage)
  • wss: winning second serve (percentage )
  • wsp: winning on any serve (percentage)
  • wrp: winning on returns (percentage)
  • tpw: total points won (percentage)
  • acpg: average number of aces (per game)
  • dfpg: average number of double faults (per game)
  • bps: break points saved (percentage)
  • tmw: total matches won (percentage)
  • retired: binary (True if 1st match back since retirement)
  • fatigue: fatigue score (based on number of matches in past 3 days)
  • complete: player completeness score (Sipko)
  • serveadv: score to measure the relative advantage when serving
  • direct: head to head balance with a particular player

Note that rank, rankpts, height, and hand can be read from player data directly with no calculation. On the other hand, fsp, wfs, wss, wsp, wrp, tpw, acpg, dfpg, bps can be calculated for any given match, but for purposes of predicting a future match, need to be averaged over the historical record. Finally, tmw, retired, fatigue, complete, serveadv, and direct are not calculated "match-by-match", but rather derived from the historical record.

In order to derive these features, we'll first need to clean the data a bit. Specifically, we need to deal with missing or null values, as well as rectify incorrect values. Examples of issues include:

  • The 'score' column contains inconsistent strings for indicating irregular outcomes
  • Many matches don't include statistics such as ace rates, double faults, etc.

I. Data extraction

Import statements


In [474]:
import sqlalchemy # pandas-mysql interface library
import sqlalchemy.exc # exception handling
from   sqlalchemy import create_engine  # needed to define db interface
import sys # for defining behavior under errors
import numpy as np # numerical libraries
import scipy as sp
import pandas as pd # for data analysis
import pandas.io.sql as sql # for interfacing with MySQL database
import matplotlib as mpl # a big library with plotting functionality
import matplotlib.pyplot as plt # a subset of matplotlib with most of the useful tools
import IPython as IP
%matplotlib inline 
import pdb
#%qtconsole

Player and odds data from 2010-2016 has beeen matched and stored. Retrieve, merge, and rename.


In [973]:
pickle_dir = '../pickle_files/'
odds_file = 'odds.pkl'
matches_file = 'matches.pkl'

odds= pd.read_pickle(pickle_dir + odds_file)
matches= pd.read_pickle(pickle_dir + matches_file)
data = pd.merge(matches,odds[['PSW','PSL','key']],how='inner',on='key')

Get additional training data. We'll include data from 2005, excluding Davis Cup matches, and focusing exclusively on players who played in our 2010-2016 set.


In [1710]:
# name of database
db_name = "tennis"
# name of db user
username = "testuser"
# db password for db user
password = "test623"
# location of atp data files
atpfile_directory = "../data/tennis_atp-master/"

# focus on most recent data; exclude Davis Cup stuff
startdate = '20050101'
enddate = '20161231'

engine = create_engine('mysql+mysqldb://' + username + ':' + password + '@localhost/' + db_name)
    
# get unique winners and losers in our set
players = tuple(pd.concat((data.winner_id,data.loser_id)).unique())

# load all data pertinant for any player
with engine.begin() as connection:
    matches_hist = pd.read_sql_query("""SELECT * FROM matches \
    WHERE tourney_date >= '""" + startdate + """' \
    AND tourney_date <= '""" + enddate + """' \
    AND (winner_id IN %(p)s \
    OR loser_id IN %(p)s) \
    AND tourney_name NOT LIKE 'Davis%%';""",connection,params={'p':players})

II. Calculate match statistics

Many of the features we will develop will involve quantities derived from previous matches. Examples include weighted historical averages of winning first serves, double faults, etc. This section calculates some important quantities for each match.

Insert index on matches_hist


In [1711]:
matches_hist['key'] = np.arange(len(matches_hist))

Extract text string indicating unusual match outcomes


In [1712]:
# scores are just numbers, unless something weird happened.  Extract comments about irregular outcomes.  
t=matches_hist.score.str.extractall('(?P<comment>[a-zA-Z]+.+)').xs(0,level='match')
matches_hist = pd.merge(matches_hist,t,how='outer',left_index=True, right_index=True)
matches_hist.comment.unique()


Out[1712]:
array([nan, 'W/O', 'RET', 'DEF', 'Jun'], dtype=object)

Calculate games scores for each set, store in a separate dataframe


In [1713]:
# discard comments and trailing white space
scores = matches_hist.score.str.replace('(?P<comment>[a-zA-Z]+.+)','')
scores = scores.str.replace('(?P<comment>\([0-9]+\))','').str.strip()

# split the game scores into columns of a dataframe
scores = scores.str.split('-|\s',expand=True)
scores.columns=['W1','L1','W2','L2','W3','L3','W4','L4','W5','L5']
scores = scores.apply(lambda x: pd.to_numeric(x))

Store the number of games played in each match.


In [1714]:
ngames = np.sum(scores,axis=1)
matches_hist.insert(0,'ngames',ngames.astype('int'))

It seems a few matches were cut short for no recorded reason:


In [1715]:
# sanity check:  are matches involving few than 12 games identical to those with commments?
idx1 = (ngames<12)
idx2 = matches_hist.comment.notnull()
z=(idx1*1)*(idx2*1-1)
zz = np.where(np.abs(z))[0]
print("matches with weird outcomes:  ")
print(matches_hist.loc[zz,'score'])


matches with weird outcomes:  
6018     6-3 1-1
7245     6-3 1-0
7286        None
7466         4-0
7937     6-3 1-0
19303        4-1
Name: score, dtype: object

Calcuate first serve percentages (fsp) for both winners and losers.


In [1716]:
matches_hist.insert(0,'w_fsp',matches_hist.w_1stIn/matches_hist.w_svpt)
matches_hist.insert(0,'l_fsp',matches_hist.l_1stIn/matches_hist.l_svpt)
matches_hist.loc[matches_hist.]

Calculate winning first serve percentages (wfs)


In [1717]:
matches_hist.insert(0,'w_wfs',matches_hist.w_1stWon/matches_hist.w_svpt)
matches_hist.insert(0,'l_wfs',matches_hist.l_1stWon/matches_hist.l_svpt)

Calculate second serves in (2ndIn) for both winners and losers


In [1718]:
matches_hist.insert(0,'w_2ndIn',matches_hist.w_svpt-matches_hist.w_df-matches_hist.w_1stIn)
matches_hist.insert(0,'l_2ndIn',matches_hist.l_svpt-matches_hist.l_df-matches_hist.l_1stIn)

Calculate second serve (ssp) percentages


In [1719]:
matches_hist.insert(0,'w_ssp',matches_hist.w_2ndIn/(matches_hist.w_2ndIn+matches_hist.w_df))
matches_hist.insert(0,'l_ssp',matches_hist.l_2ndIn/(matches_hist.l_2ndIn+matches_hist.l_df))

Calculate wining second serve percentages (wss)


In [1720]:
matches_hist.insert(0,'w_wss',matches_hist.w_2ndWon/matches_hist.w_2ndIn)
matches_hist.insert(0,'l_wss',matches_hist.l_2ndWon/matches_hist.l_2ndIn)

Calculate overall win on serve percentages (wsp)


In [1860]:
#matches_hist.insert(0,'w_wsp',(matches_hist.w_1stWon + matches_hist.w_2ndWon)/matches_hist.w_svpt)
#matches_hist.insert(0,'l_wsp',(matches_hist.l_1stWon+matches_hist.l_2ndWon)/matches_hist.l_svpt)
matches_hist['w_wsp']=(matches_hist.w_1stWon + matches_hist.w_2ndWon)/matches_hist.w_svpt
matches_hist['l_wsp']=(matches_hist.l_1stWon+matches_hist.l_2ndWon)/matches_hist.l_svpt

Calculate winning on return percentages (wrp).

[(# of opponent serves) - (# of opponent service victories)]/(# of opponent serves)


In [1722]:
matches_hist.insert(0,'w_wrp',(matches_hist.l_svpt - matches_hist.l_1stWon \
                               - matches_hist.l_2ndWon)/(matches_hist.l_svpt))
matches_hist.insert(0,'l_wrp',(matches_hist.w_svpt - matches_hist.w_1stWon \
                              - matches_hist.w_2ndWon)/(matches_hist.w_svpt))

Calculate total points won percentage (tpw)


In [1723]:
matches_hist.insert(0,'w_tpw',(matches_hist.l_svpt\
                          -matches_hist.l_1stWon-matches_hist.l_2ndWon\
                          +matches_hist.w_1stWon +matches_hist.w_2ndWon)/\
                        (matches_hist.l_svpt + matches_hist.w_svpt))
matches_hist.insert(0,'l_tpw',(matches_hist.w_svpt\
                        -matches_hist.w_1stWon-matches_hist.w_2ndWon\
                        +matches_hist.l_1stWon +matches_hist.l_2ndWon)/\
                        (matches_hist.l_svpt + matches_hist.w_svpt))

Calculate double faults per game (dfpg), per game

There are a couple of bad entries for SvGms. We'll first fix those. (Note: the fix evenly partions total games between both players. This could result in a fractional number of service games. My hunch is the total effect on the stats is minimal.)


In [1828]:
idx = np.where(((matches_hist.w_SvGms == 0)|(matches_hist.l_SvGms==0)) & (matches_hist.ngames >1))
print(matches_hist.loc[idx[0],['w_df','l_df','w_SvGms','l_SvGms','score','ngames']])
matches_hist.loc[idx[0],'w_SvGms'] = matches_hist.ngames[idx[0]]/2
matches_hist.loc[idx[0],'l_SvGms'] = matches_hist.ngames[idx[0]]/2     
print(matches_hist.loc[idx[0],['w_df','l_df','w_SvGms','l_SvGms','score','ngames']])


       w_df  l_df  w_SvGms  l_SvGms        score  ngames
2956    4.0   2.0      0.0      0.0      6-3 6-3      18
9664    2.0  12.0     16.0      0.0  6-3 6-4 7-5      31
12770   0.0   0.0      0.0      4.0  6-2 6-2 6-1      23
18690   3.0   0.0     12.0      0.0      7-5 7-5      24
18695   1.0   0.0      9.0      0.0      6-3 6-3      18
       w_df  l_df  w_SvGms  l_SvGms        score  ngames
2956    4.0   2.0      9.0      9.0      6-3 6-3      18
9664    2.0  12.0     15.5     15.5  6-3 6-4 7-5      31
12770   0.0   0.0     11.5     11.5  6-2 6-2 6-1      23
18690   3.0   0.0     12.0     12.0      7-5 7-5      24
18695   1.0   0.0      9.0      9.0      6-3 6-3      18

In [1851]:
matches_hist.insert(0,'w_dfpg',matches_hist.w_df/matches_hist.w_SvGms)
matches_hist.insert(0,'l_dfpg',matches_hist.l_df/matches_hist.l_SvGms)
#matches_hist['w_dfpg']=matches_hist.w_df/matches_hist.w_SvGms
#matches_hist['l_dfpg']=matches_hist.l_df/matches_hist.l_SvGms

Calculate aces per game (acpg), per game


In [1855]:
matches_hist.insert(0,'w_acpg',matches_hist.w_ace/matches_hist.w_SvGms)
matches_hist.insert(0,'l_acpg',matches_hist.l_ace/matches_hist.l_SvGms)
#matches_hist['w_acpg']=matches_hist.w_ace/matches_hist.w_SvGms
#matches_hist['l_acpg']=matches_hist.l_ace/matches_hist.l_SvGms

Calculate break points saved percentage (bpsp)


In [ ]:
matches_hist.insert(0,'w_bps',matches_hist.w_bpSaved/matches_hist.w_bpFaced)
matches_hist.insert(0,'l_bps',matches_hist.l_bpSaved/matches_hist.l_bpFaced)

Flag games with premature closure, probably due to injury (retired)


In [1727]:
matches_hist.insert(0,'retired',0)
matches_hist.loc[(matches_hist.comment=='RET'),'retired']=1

Flag games won as a walkover (wo)


In [1728]:
matches_hist.insert(0,'walkover',0)
matches_hist.loc[(matches_hist.comment=='W/O'),'walkover']=1

Calculate player completeness (complete), defined as $$ wsp \times wrp $$

Note: might be more useful to aggregate first, rather than compute on a per-match basis.


In [1738]:
matches_hist.insert(0,'w_complete',matches_hist.w_wsp*matches_hist.w_wrp)
matches_hist.insert(0,'l_complete',matches_hist.l_wsp*matches_hist.l_wrp)

Calculate player service advantage (serveadv), defined as

$$ wsp_1 -wrp_2 $$

Note: as with complete, it might be more useful to aggregated first


In [1739]:
matches_hist.insert(0,'w_serveadv',matches_hist.w_wsp-matches_hist.l_wrp)
matches_hist.insert(0,'l_serveadv',matches_hist.l_wsp-matches_hist.w_wrp)

Sanity check: investigate calculated quantities


In [1861]:
idx = matches_hist.comment.isnull()
labels = ['dfpg', 'acpg', 'tpw', 'wrp', 'wsp', 'wss', 'wfs', 'fsp', 'ssp', 'bps','complete']
for label in labels:
    printstr = label + ": max for winner/loser is {:5.2f}/{:5.2f}, min for winner/loser is {:5.2f}/{:5.2f}"
    v1 = eval('matches_hist.w_' + label + '[idx].max()')
    v2 = eval('matches_hist.l_' + label + '[idx].max()')
    v3 = eval('matches_hist.w_' + label + '[idx].min()')
    v4 = eval('matches_hist.l_' + label + '[idx].min()')
    print(printstr.format(v1,v2,v3,v4))


dfpg: max for winner/loser is  1.44/ 2.00, min for winner/loser is  0.00/ 0.00
acpg: max for winner/loser is  2.80/ 2.55, min for winner/loser is  0.00/ 0.00
tpw: max for winner/loser is  0.79/ 0.56, min for winner/loser is  0.44/ 0.21
wrp: max for winner/loser is  0.82/ 0.53, min for winner/loser is  0.15/ 0.00
wsp: max for winner/loser is  1.00/ 0.85, min for winner/loser is  0.47/ 0.18
wss: max for winner/loser is  1.00/ 1.00, min for winner/loser is  0.00/ 0.00
wfs: max for winner/loser is  0.84/ 0.74, min for winner/loser is  0.22/ 0.08
fsp: max for winner/loser is  1.00/ 1.00, min for winner/loser is  0.29/ 0.24
ssp: max for winner/loser is  1.00/ 1.00, min for winner/loser is  0.25/ 0.36
bps: max for winner/loser is  1.00/ 1.00, min for winner/loser is  0.00/ 0.00
complete: max for winner/loser is  0.63/ 0.25, min for winner/loser is  0.11/ 0.00

III. Calculate features

Each feature generally requires some calculation. We'll use the bigger data set matches_hist to drive the calculations, and store the results in a data frame called features, of the same size as data.

Extract winner and loser data separately and concatenate into dataframe all_records with uniform column names.


In [1862]:
# extract winner stats
w_records = matches_hist[['winner_id',
                          'tourney_date',
                          'tourney_id',
                          'match_num',
                          'ngames',
                          'key',
                          'w_acpg', # avg. no of aces per game
                          'w_dfpg', # avg no. of double faults per game
                          'w_tpw',  # total points won
                          'w_wrp',  # wining return percent
                          'w_wsp',  # winning service percent
                          'w_wss',  # winning second serve percent
                          'w_wfs',  # winning first serve percent
                          'w_fsp',  # good first serves percent
                          'w_ssp',  # good second serves percent
                          'w_bps',  # breakpoints saved percent 
                          'retired',# 1 if loser retired prematurely
                          'walkover', # 1 if loser didn't show up
                          'surface', # 'Hard', 'Clay', or 'Grass'
                          'winner_age',  # age
                          'winner_ht',   # height
                          'winner_rank', # rank
                          'winner_rank_points' # rank points
                         ]]

# rename columns
newcols = {'winner_id':'pid', 
           'tourney_date':'date',
           'tourney_id':'tid',
           'match_num':'mid',
           'ngames':'ngames',
           'key':'key',
           'w_acpg':'acpg',
           'w_dfpg':'dfpg',
           'w_tpw':'tpw',
           'w_wrp':'wrp',
           'w_wsp':'wsp',
           'w_wss':'wss',
           'w_wfs':'wfs',
           'w_fsp':'fsp',
           'w_ssp':'ssp',
           'w_bps':'bps',
           'retired':'retired',
           'walkover':'walkover',
           'surface':'surface',
           'winner_age':'age',  
           'winner_ht':'ht',   
           'winner_rank':'rank', 
           'winner_rank_points':'rank_points'
          }
w_records = w_records.rename(columns = newcols)

# record that the outcome was a victory for these players
w_records['outcome'] = np.ones(len(w_records))

# extract loser stats
l_records = matches_hist[['loser_id',
                          'tourney_date',
                          'tourney_id',
                          'match_num',
                          'ngames',
                          'key',
                          'l_acpg', # avg. no of aces per game
                          'l_dfpg',  # avg no. of double faults per game
                          'l_tpw',  # total points won
                          'l_wrp',  # wining return percent
                          'l_wsp',  # winning service percent
                          'l_wss',  # winning second serve percent
                          'l_wfs',  # winning first serve percent
                          'l_fsp',  # percent of successful first serves
                          'l_ssp',  # percent of successful second serves
                          'l_bps',  # percent of breakpoints saved  
                          'retired',# 1 if loser retired prematurely
                          'walkover',# 1 if loser didn't show up
                          'surface', # 'Hard', 'Clay', or 'Grass'
                          'loser_age',  # age
                          'loser_ht',   # height
                          'loser_rank', # rank
                          'loser_rank_points' # rank points
                         ]]

# rename columns
newcols = {'loser_id':'pid', 
           'tourney_date':'date',
           'tourney_id':'tid',
           'match_num':'mid',
           'ngames':'ngames',
           'key':'key',
           'l_acpg':'acpg',
           'l_dfpg':'dfpg',
           'l_tpw':'tpw',
           'l_wrp':'wrp',
           'l_wsp':'wsp',
           'l_wss':'wss',
           'l_wfs':'wfs',
           'l_fsp':'fsp',
           'l_ssp':'ssp',
           'l_bps':'bps', 
           'retired':'retired',
           'walkover':'walkover',
           'surface':'surface',
           'loser_age':'age',  
           'loser_ht':'ht',   
           'loser_rank':'rank', 
           'loser_rank_points':'rank_points'
          }
l_records = l_records.rename(columns = newcols)

# record outcome as a loss
l_records['outcome'] = np.zeros(len(w_records))

# fuse all the data into one dataframe
all_records = pd.concat([w_records,l_records]).reset_index().sort_values(['key']).replace(np.inf,np.nan)

Calculate surface weighting matrix. Note that the resulting values are quite different from Sipko's.


In [1863]:
grouped = all_records.groupby(['pid','surface'])
t=grouped['outcome'].mean()
surf_wt = t.unstack(level=-1).corr()
surf_wt


Out[1863]:
surface Carpet Clay Grass Hard
surface
Carpet 1.000000 0.342744 0.295605 0.506449
Clay 0.342744 1.000000 0.363446 0.466821
Grass 0.295605 0.363446 1.000000 0.548772
Hard 0.506449 0.466821 0.548772 1.000000

Function to calculate static features. (Specifically, calculate normalized rank, rankpts, age, height, hand features for every match in the dataset. Operates on the data as a whole, rather than by line.)


In [1869]:
def  get_static_features(data):
    """
    Description: returns differences of those features that don't depend on match histories. 
    (Rank, Rankpoints, Height, Hand)
    
    Input:  dataframe with all the match data for which features are to be calculated
    Output:  another dataframe of the same length but only four columns, each with one feature
    """
    
    # boolean, 1 means (winner,loser) are (player1,player2), 0 means the reverse
    outcome = data['outcome'] 
    
    # features dataframe should include merge identifiers
    features = data[['tourney_id', 'match_num','tourney_date','key']].copy()
    
    # rank (normalize)
    rank=(data.loser_rank-data.winner_rank)*(-1)**outcome
    features.insert(0,'rank',rank/rank.std())

    # rank points (normalize)
    rankpts = (data.loser_rank_points-data.winner_rank_points)*(-1)**outcome
    features.insert(0,'rankpts',rankpts/rankpts.std())

    # height (normalize)
    height = (data.loser_ht-data.winner_ht)*(-1)**outcome
    features.insert(0,'height',height/height.std())   
    
    # age (normalize)
    height = (data.loser_age-data.winner_age)*(-1)**outcome
    features.insert(0,'age',height/height.std())  
    
    # hand (1 for right, 0 for left)
    hand = ((data.loser_hand=='R')*1-(data.winner_hand=='R')*1)*(-1)**outcome
    hand.iloc[np.where((data['winner_hand']=='U')|\
                      (data['loser_hand']=='U'))[0]]=np.nan
    features.insert(0,'hand',hand)
    
    return features

Get dynamic features (i.e. all features that require some time averaging.)


In [1879]:
def get_dynamic_features(x):
    """
    Input:  a row of the dataframe.  Needs to have the following fields:
        pid (player id number)
        tid (tournament id number)
        mid (match id number)
        date (match date)
        surface (match surface)
    Output: a dataframe of two columns, one a time discount, the other a surface discount 
    
    """
        
    # extract identifiers and date from input row
    pid = x['pid'] # player id 
    tid = x['tid'] # tourney id
    mid = x['mid'] # match id
    date = x['date']
    surface = x['surface']
    
    # extract all historical records for this player, from before this match
    records = all_records.loc[(all_records.pid==pid) & (all_records.date <= date) &\
                         ((all_records.tid != tid) | (all_records.mid != mid)),:].copy()
    
    # get time discount factor
    p = 0.8 
    t = (date - records.date).apply(lambda x: x.days/365)
    t_wt = p**t
    t_wt.loc[t_wt>p]=p
    
    # get surface discount factor
    s_wt = records.surface.apply(lambda x: surf_wt.loc[x,surface])

    # get time and court weighted averages of serve and performance stats
    t = records[['dfpg','acpg','tpw','wrp','wsp',\
              'wss','wfs','fsp','ssp','bps']].mul(t_wt*s_wt,axis=0).sum(axis=0)/\
        records[['dfpg','acpg','tpw','wrp','wsp',\
              'wss','wfs','fsp','ssp','bps']].notnull().mul(t_wt*s_wt,axis=0).sum(axis=0)
 
    if len(records)==0:
        t['complete']=np.nan
        t['retired']=np.nan
        return t

    # get player completeness
    t['complete'] = t['wsp']*t['wrp']

    # get player serveadvantage
    t['serveadv'] = t['wsp']+t['wrp']    
    
    # get player "return from retirement" status
    t['retired'] = records.loc[records.date==records.date.min(),'retired'].values[0]
    
    # return a series
    return t

In [1535]:
def dynamic_feature_wrapper(x):
    """
    calls "get_dynamic_features" to extract dynamic features for each player
    """    
    
    pids = x[['lid','wid']]
    y = x.copy()
    
    # get Player1 info
    y['pid'] = pids[y['outcome']]
    P1_features = get_dynamic_features(y)
    
    # get Player0 info
    y['pid'] = pids[1-y['outcome']]
    P2_features = get_dynamic_features(y)
     
    # features are differences
    features = P1_features - P2_features
    
    # compute service advantage
    features['serveadv'] = 
        
    return

In [1870]:
data['outcome']=np.random.choice([0,1],size=len(features))
s_features=get_static_features(data)
s_features


Out[1870]:
hand age height rankpts rank tourney_id match_num tourney_date key
0 0.0 -0.941438 -1.184513 0.293678 -3.651544 2010-1536 11 2010-05-09 0
1 0.0 -1.058447 0.296128 -0.136147 0.187259 2010-1536 7 2010-05-09 1
2 0.0 0.203960 -0.690966 0.360682 -0.499356 2010-1536 20 2010-05-09 2
3 0.0 1.050396 -0.197419 0.619075 -0.460344 2010-1536 22 2010-05-09 3
4 0.0 -1.521115 1.480641 -0.000356 0.007802 2010-1536 17 2010-05-09 4
5 1.0 -0.127744 0.197419 -0.183905 0.163851 2010-1536 9 2010-05-09 5
6 0.0 1.243085 0.987094 -0.002851 0.007802 2010-1536 13 2010-05-09 6
7 0.0 0.388062 1.283223 0.078409 -0.054617 2010-1536 15 2010-05-09 7
8 1.0 -1.114805 NaN 0.561694 -2.286116 2010-1536 21 2010-05-09 8
9 0.0 -0.953783 1.283223 -0.636182 0.249678 2010-1536 3 2010-05-09 9
10 0.0 0.738551 0.197419 -0.247701 0.296493 2010-1536 18 2010-05-09 10
11 0.0 -0.838385 -0.197419 0.007128 -0.031210 2010-1536 12 2010-05-09 11
12 0.0 -1.756743 -2.270317 -0.219189 0.288690 2010-1536 1 2010-05-09 12
13 -1.0 0.977937 -1.974189 0.290826 -0.522764 2010-1536 4 2010-05-09 13
14 0.0 -0.707421 1.480641 -0.020671 0.101432 2010-1536 24 2010-05-09 14
15 1.0 -0.053137 -0.493547 -0.314705 0.522764 2010-1536 16 2010-05-09 15
16 0.0 -0.580214 -0.000000 0.240573 -0.257481 2010-1536 2 2010-05-09 16
17 0.0 0.092856 -0.987094 0.085537 -0.039012 2010-1536 10 2010-05-09 17
18 0.0 0.426707 0.296128 0.194597 -0.444739 2010-1536 19 2010-05-09 18
19 0.0 -0.620469 -0.690966 -0.130088 0.280888 2010-1536 14 2010-05-09 19
20 0.0 1.233424 -0.789675 -0.467959 0.358912 2010-1536 6 2010-05-09 20
21 0.0 -1.247379 3.553539 0.512866 -0.944096 2010-1536 23 2010-05-09 21
22 0.0 -0.806717 0.987094 -0.226673 0.234073 2010-1536 8 2010-05-09 22
23 0.0 -0.923726 -0.987094 -0.028869 0.054617 2010-1536 5 2010-05-09 23
24 0.0 -0.028447 0.690966 3.404022 -0.351110 2010-1536 25 2010-05-09 24
25 0.0 0.417582 -0.493547 0.304013 -0.296493 2010-1536 34 2010-05-09 25
26 0.0 1.211954 -0.690966 0.390620 -0.163851 2010-1536 30 2010-05-09 26
27 -1.0 -0.675753 0.789675 0.293678 -0.553974 2010-1536 32 2010-05-09 27
28 0.0 -1.510917 -0.000000 1.684369 -0.366715 2010-1536 38 2010-05-09 28
29 1.0 0.103054 NaN 0.346069 -2.161277 2010-1536 28 2010-05-09 29
... ... ... ... ... ... ... ... ... ...
18418 1.0 0.082121 -0.493547 -1.420273 0.312098 2016-M052 30 2016-02-15 18649
18419 -1.0 0.646769 NaN -0.429111 0.405727 2016-M052 29 2016-02-15 18650
18420 -1.0 0.476623 0.493547 1.313351 -0.218468 2016-M052 27 2016-02-15 18651
18421 -1.0 -0.932851 0.987094 0.000000 0.007802 2016-M052 26 2016-02-15 18652
18422 1.0 0.936071 NaN -0.083399 0.366715 2016-M052 25 2016-02-15 18653
18423 0.0 2.239271 NaN 0.853589 -0.101432 2016-M052 28 2016-02-15 18654
18424 1.0 0.153507 -0.197419 -1.442014 0.374517 2016-M052 23 2016-02-15 18655
18425 0.0 0.208254 NaN 0.170361 -0.312098 2016-M052 21 2016-02-15 18656
18426 1.0 1.648859 NaN 0.270511 -2.286116 2016-M052 24 2016-02-15 18657
18427 1.0 1.726149 -0.690966 -0.018533 0.031210 2016-M052 22 2016-02-15 18658
18428 0.0 0.351563 0.690966 -0.364602 0.733430 2016-M052 18 2016-02-15 18659
18429 -1.0 -0.484138 NaN 0.026018 -0.093629 2016-M052 19 2016-02-15 18660
18430 0.0 0.205571 NaN -0.466534 0.553974 2016-M052 17 2016-02-15 18661
18431 1.0 1.135737 -1.283223 1.228527 -0.319900 2016-M052 20 2016-02-15 18662
18432 -1.0 1.001016 NaN 1.468744 -0.475949 2016-M052 15 2016-02-15 18663
18433 -1.0 0.697759 NaN -0.040630 0.148246 2016-M052 14 2016-02-15 18664
18434 0.0 0.619396 NaN 0.037779 -0.163851 2016-M052 2 2016-02-15 18665
18435 -1.0 0.167999 0.789675 0.001782 -0.007802 2016-M052 11 2016-02-15 18666
18436 1.0 1.787874 NaN 1.002923 -2.567004 2016-M052 16 2016-02-15 18667
18437 -1.0 -0.814768 NaN -0.114050 0.436937 2016-M052 13 2016-02-15 18668
18438 0.0 -2.115821 NaN -0.081973 0.335505 2016-M052 7 2016-02-15 18669
18439 -1.0 0.386451 0.493547 -0.240573 0.179456 2016-M052 6 2016-02-15 18670
18440 0.0 -0.421876 0.296128 -0.239860 0.210666 2016-M052 8 2016-02-15 18671
18441 0.0 -0.266222 -0.000000 -0.116901 0.491554 2016-M052 12 2016-02-15 18672
18442 0.0 0.586655 0.987094 0.084112 -0.452542 2016-M052 3 2016-02-15 18673
18443 -1.0 -0.991355 NaN -0.662556 0.468147 2016-M052 1 2016-02-15 18674
18444 0.0 1.492131 NaN -0.413429 0.335505 2016-M052 9 2016-02-15 18675
18445 0.0 -0.827650 NaN 0.101575 -0.608591 2016-M052 5 2016-02-15 18676
18446 -1.0 0.480380 0.789675 0.067717 -0.234073 2016-M052 4 2016-02-15 18677
18447 0.0 -2.651485 NaN -1.515077 3.799791 2016-M052 10 2016-02-15 18678

18448 rows × 9 columns


In [1887]:
x=data[['tourney_id','match_num','tourney_date','key','winner_id','loser_id','surface','outcome']].copy()
x.rename(columns={'tourney_id':'tid','match_num':'mid','tourney_date':'date',\
                    'winner_id':'wid','loser_id':'lid'},inplace=True)

In [1892]:
x.iloc[0:5,:].apply(dynamic_feature_wrapper,axis=1)
#x.iloc[0:5,:]


Out[1892]:
dfpg acpg tpw wrp wsp wss wfs fsp ssp bps complete retired
0 0.006432 0.301983 -0.017329 -0.040883 0.006915 -0.008153 -0.021032 -0.058883 0.006722 -0.011569 -0.023750 0.0
1 -0.024968 0.122376 0.011806 -0.015830 0.042181 0.018513 0.005151 -0.051550 0.023893 0.021771 0.005523 0.0
2 -0.046246 0.185565 0.030653 0.015456 0.052124 0.045170 0.029246 -0.001637 0.011653 0.033122 0.029887 0.0
3 0.022230 -0.097171 0.042182 0.060434 0.031264 0.044102 0.026328 0.022129 -0.010285 0.042440 0.049001 0.0
4 0.003079 0.428348 0.005904 -0.015185 0.028642 0.013616 0.043240 0.031019 -0.017860 0.060902 0.001187 0.0

In [1886]:
y = matches_hist.iloc[15000]
dynamic_feature_wrapper(y)


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-1886-ef9aba4445b7> in <module>()
      1 y = matches_hist.iloc[15000]
----> 2 dynamic_feature_wrapper(y)

<ipython-input-1535-14e39c096a97> in dynamic_feature_wrapper(x)
      4     """    
      5 
----> 6     pids = x[['lid','wid']]
      7     y = x.copy()
      8 

/Users/ctoews/anaconda/envs/py3_6_2/lib/python3.6/site-packages/pandas/core/series.py in __getitem__(self, key)
    640             key = check_bool_indexer(self.index, key)
    641 
--> 642         return self._get_with(key)
    643 
    644     def _get_with(self, key):

/Users/ctoews/anaconda/envs/py3_6_2/lib/python3.6/site-packages/pandas/core/series.py in _get_with(self, key)
    681                     # handle the dup indexing case (GH 4246)
    682                     if isinstance(key, (list, tuple)):
--> 683                         return self.loc[key]
    684 
    685                     return self.reindex(key)

/Users/ctoews/anaconda/envs/py3_6_2/lib/python3.6/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1326         else:
   1327             key = com._apply_if_callable(key, self.obj)
-> 1328             return self._getitem_axis(key, axis=0)
   1329 
   1330     def _is_scalar_access(self, key):

/Users/ctoews/anaconda/envs/py3_6_2/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1539                     raise ValueError('Cannot index with multidimensional key')
   1540 
-> 1541                 return self._getitem_iterable(key, axis=axis)
   1542 
   1543             # nested tuple slicing

/Users/ctoews/anaconda/envs/py3_6_2/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis)
   1079     def _getitem_iterable(self, key, axis=0):
   1080         if self._should_validate_iterable(axis):
-> 1081             self._has_valid_type(key, axis)
   1082 
   1083         labels = self.obj._get_axis(axis)

/Users/ctoews/anaconda/envs/py3_6_2/lib/python3.6/site-packages/pandas/core/indexing.py in _has_valid_type(self, key, axis)
   1416 
   1417                 raise KeyError("None of [%s] are in the [%s]" %
-> 1418                                (key, self.obj._get_axis_name(axis)))
   1419 
   1420             return True

KeyError: "None of [['lid', 'wid']] are in the [index]"

In [1883]:
x = all_records.iloc[13002]
records = get_dynamic_features(x)
records


Out[1883]:
dfpg        0.207208
acpg        0.619216
tpw         0.521576
wrp         0.378473
wsp         0.669557
wss         0.593572
wfs         0.470833
fsp         0.632490
ssp         0.910408
bps         0.644857
complete    0.253410
retired     0.000000
dtype: float64

It will be help with indexing if we isolate all relevant features for players 1 and 0 into their own dataframes. This is what the following code does.


In [408]:
# initialize dataframes to hold features for players 1 and 0
P1 = pd.DataFrame(columns=['DATE','TID','MID','PID','HAND','HT',\
                           'AGE','RANKPTS','RANK','ACE','DF','SVPT',\
                           'FSTIN','FSTWON','SNDWON','BPSAVED','BPFACED'])
P0 = pd.DataFrame(columns=['DATE','TID','MID','PID','HAND','HT',
                           'AGE','RANKPTS','RANK','ACE','DF','SVPT',\
                           'FSTIN','FSTWON','SNDWON','BPSAVED','BPFACED'])

# define a function that returns winner info if RES=1, otherwise loser info
def assign_player_1(x):
  
    winner = pd.Series({'DATE':x['tourney_date'],\
                        'TID':x['tourney_id'],\
                         'MID':x['match_num'],\
                         'PID':x['winner_id'],\
                         'HAND':x['winner_hand'],\
                         'HT':x['winner_ht'],\
                         'AGE':x['winner_age'],\
                         'RANKPTS':x['winner_rank_points'],\
                         'RANK':x['winner_rank'],\
                         'ACE':x['w_ace'],\
                         'DF':x['w_df'],\
                         'SVPT':x['w_svpt'],\
                         'FSTIN':x['w_1stIn'],\
                         'FSTWON':x['w_1stWon'],\
                         'SNDWON':x['w_2ndWon'],\
                         'BPSAVED':x['w_bpSaved'],\
                         'BPFACED':x['w_bpFaced']})
    
    loser = pd.Series({'DATE':x['tourney_date'],\
                       'TID':x['tourney_id'],\
                         'MID':x['match_num'],\
                         'PID':x['loser_id'],\
                         'HAND':x['loser_hand'],\
                         'HT':x['loser_ht'],\
                         'AGE':x['loser_age'],\
                         'RANKPTS':x['loser_rank_points'],\
                         'RANK':x['loser_rank'],\
                         'ACE':x['l_ace'],\
                         'DF':x['l_df'],\
                         'SVPT':x['l_svpt'],\
                         'FSTIN':x['l_1stIn'],\
                         'FSTWON':x['l_1stWon'],\
                         'SNDWON':x['l_2ndWon'],\
                         'BPSAVED':x['l_bpSaved'],\
                         'BPFACED':x['l_bpFaced']})
    
    if x['RES']==1:
        return winner
    else:
        return loser
                         
# mutatis mutandis for player 0.  (Note:  no need to rewrite this function if I can figure 
# out how to assign two outputs within an "apply" call.)      
def assign_player_0(x):
  
    winner = pd.Series({'DATE':x['tourney_date'],\
                        'TID':x['tourney_id'],\
                         'MID':x['match_num'],\
                         'PID':x['winner_id'],\
                         'HAND':x['winner_hand'],\
                         'HT':x['winner_ht'],\
                         'AGE':x['winner_age'],\
                         'RANKPTS':x['winner_rank_points'],\
                         'RANK':x['winner_rank'],\
                         'ACE':x['w_ace'],\
                         'DF':x['w_df'],\
                         'SVPT':x['w_svpt'],\
                         'FSTIN':x['w_1stIn'],\
                         'FSTWON':x['w_1stWon'],\
                         'SNDWON':x['w_2ndWon'],\
                         'BPSAVED':x['w_bpSaved'],\
                         'BPFACED':x['w_bpFaced']})
    
    loser = pd.Series({'DATE':x['tourney_date'],\
                       'TID':x['tourney_id'],\
                         'MID':x['match_num'],\
                         'PID':x['loser_id'],\
                         'HAND':x['loser_hand'],\
                         'HT':x['loser_ht'],\
                         'AGE':x['loser_age'],\
                         'RANKPTS':x['loser_rank_points'],\
                         'RANK':x['loser_rank'],\
                         'ACE':x['l_ace'],\
                         'DF':x['l_df'],\
                         'SVPT':x['l_svpt'],\
                         'FSTIN':x['l_1stIn'],\
                         'FSTWON':x['l_1stWon'],\
                         'SNDWON':x['l_2ndWon'],\
                         'BPSAVED':x['l_bpSaved'],\
                         'BPFACED':x['l_bpFaced']})
    
    if x['RES']==1:
        return loser
    else:
        return winner

In [409]:
matches_hist.insert(len(matches_hist.columns),'RES',features['RES'].values)
P1=matches_hist.apply(assign_player_1,axis=1)
P0=matches_hist.apply(assign_player_0,axis=1)

Features I and II: differences of ranks and rank points (RPTS)

We'll scale the rank point differences by the standard deviation.


In [ ]:
features.insert(len(features.columns), 'RANKPTS', P1['RANKPTS']-P0['RANKPTS'])
features.insert(len(features.columns), 'RANK', P1['RANK']-P0['RANK'])

features['RANKPTS'] = features['RANKPTS']/features['RANKPTS'].std()
features['RANK'] = features['RANK']/features['RANK'].std()

In [450]:
# define figure and axes
fig = plt.figure(figsize=(15,5))
ax0 = fig.add_subplot(121)
ax1 = fig.add_subplot(122)

ax0.hist(features.RANK.dropna())
ax0.set_title('Diff. in rank')
ax1.hist(features.RANKPTS.dropna())
ax1.set_title('Diff in rank pts')


Out[450]:
<matplotlib.text.Text at 0x13fa47c18>

Feature III: differences of first serve winning percentage


In [426]:
P1.insert(len(P1.columns),'FSWPCT',P1['FSTWON']/P1['SVPT'])
P0.insert(len(P0.columns),'FSWPCT',P0['FSTWON']/P0['SVPT'])

P1_grouped = P1.groupby('PID')
P0_grouped = P0.groupby('PID')

def extract_features(group):
    mean_fswpct = group['FSWPCT'].mean()
    size = len(group)
    return pd.Series({'mean_fswpct':mean_fswpct,'size':size})

t1=P1_grouped.apply(extract_features).reset_index()
t0=P0_grouped.apply(extract_features).reset_index()

t2 = pd.merge(t1,t0,how='outer',on='PID')
t2 = t2.fillna(0)
t2['FSWPCT_HIST'] = (t2['mean_fswpct_x']*t2['size_x'] +\
                t2['mean_fswpct_y']*t2['size_y'])/(t2['size_x']+t2['size_y'])

In [428]:
P1=pd.merge(P1,t2[['PID','FSWPCT_HIST']],how='inner',on='PID')    
P0=pd.merge(P0,t2[['PID','FSWPCT_HIST']],how='inner',on='PID')

In [ ]:
features['FSWPCT']=P1['FSWPCT']-P0['FSWPCT']

In [451]:
plt.hist(features.FSWPCT.dropna())
plt.title('Diff. in first serve winning percentages')


Out[451]:
<matplotlib.text.Text at 0x13f2a9d68>

Feature 4: Height differences


In [ ]:
features['HT'] = P1['HT']-P2['HT']
features['HT'] = features['HT']/features['HT'].std()

In [452]:
plt.hist(features.HT.dropna())
plt.title('Difference in height')


Out[452]:
<matplotlib.text.Text at 0x13fdb0a58>

Feature 5: Age differences


In [ ]:
features['AGE'] = P1['AGE']-P2['AGE']
features['AGE'] = features['AGE']/features['AGE'].std()

In [456]:
plt.hist(features.AGE.dropna())
plt.title('Difference in age')


Out[456]:
<matplotlib.text.Text at 0x12177f438>

Future Work

Proposed features:

  • percent of winning service returns (derivable from existing data)
  • percent of winning tie-breakers
  • percent of upsets (losing when higher ranked, winning when lower ranked)
  • percent of double faults per game
  • percent aces per game
  • percent head-to-head victories (against the same player)
  • advantage when serving
  • time since injury

Parameters to solve for:

  • Time discount factor over historical averages
  • Match weighting factor. Issues include:
    -heads-to-heads
    -common opponents
    -court surface

Take-aways:

  1. rank is not the whole picture
    -Ex: David Nalbandian has gone 8-11 against Federer, in spite of being lower ranked
  2. the data is messy
    -incomplete matches
    -missing data
    -incorrect data
  3. informative features need to be constructed
    -infering injury
    -head-to-head history

In [444]:
P1=pd.merge(P1,t2[['PID','SSWPCT_HIST']],how='inner',on='PID')    
P0=pd.merge(P0,t2[['PID','SSWPCT_HIST']],how='inner',on='PID')