Extracting features

Author: Carl Toews
File: extract_features.ipynb

Description:

In order to implement machine learning algorithms, we need to develop a set of informative features. Following Machine learning for predicting professional tennis matches [Sipko, 2015], for each match we assign one player to be "Player 0" and the other to be "Player 1", and call the outcome a 0 if Player 0 won and a 1 otherwise. For each match, we produce a set of features, each a measure of difference between some characteristic of the two players. The characteristics we consider include the following (many of these are from Sipko):

rank: rank
rankpts: rank points
height: height
hand: handedness binary (0 for right, 1 for left)
fsp: "first serve is valid" (percentage)
wfs: winning first serve (percentage)
wss: winning second serve (percentage )
wsp: winning on any serve (percentage)
wrp: winning on returns (percentage)
tpw: total points won (percentage)
acpg: average number of aces (per game)
dfpg: average number of double faults (per game)
bps: break points saved (percentage)
tmw: total matches won (percentage)
retired: binary (True if 1st match back since retirement)
fatigue: fatigue score (based on number of matches in past 3 days)
complete: player completeness score (Sipko)
serveadv: score to measure the relative advantage when serving
direct: head to head balance with a particular player

Note that rank, rankpts, height, and hand can be read from player data directly with no calculation. On the other hand, fsp, wfs, wss, wsp, wrp, tpw, acpg, dfpg, bps can be calculated for any given match, but for purposes of predicting a future match, need to be averaged over the historical record. Finally, tmw, retired, fatigue, complete, serveadv, and direct are not calculated "match-by-match", but rather derived from the historical record.

In order to derive these features, we'll first need to clean the data a bit. Specifically, we need to deal with missing or null values, as well as rectify incorrect values. Examples of issues include:

The 'score' column contains inconsistent strings for indicating irregular outcomes
Many matches don't include statistics such as ace rates, double faults, etc.

I. Data extraction

Import statements



In [474]:

    
import sqlalchemy # pandas-mysql interface library
import sqlalchemy.exc # exception handling
from   sqlalchemy import create_engine  # needed to define db interface
import sys # for defining behavior under errors
import numpy as np # numerical libraries
import scipy as sp
import pandas as pd # for data analysis
import pandas.io.sql as sql # for interfacing with MySQL database
import matplotlib as mpl # a big library with plotting functionality
import matplotlib.pyplot as plt # a subset of matplotlib with most of the useful tools
import IPython as IP
%matplotlib inline 
import pdb
#%qtconsole

Player and odds data from 2010-2016 has beeen matched and stored. Retrieve, merge, and rename.



In [973]:

    
pickle_dir = '../pickle_files/'
odds_file = 'odds.pkl'
matches_file = 'matches.pkl'

odds= pd.read_pickle(pickle_dir + odds_file)
matches= pd.read_pickle(pickle_dir + matches_file)
data = pd.merge(matches,odds[['PSW','PSL','key']],how='inner',on='key')

Get additional training data. We'll include data from 2005, excluding Davis Cup matches, and focusing exclusively on players who played in our 2010-2016 set.



In [1710]:

    
# name of database
db_name = "tennis"
# name of db user
username = "testuser"
# db password for db user
password = "test623"
# location of atp data files
atpfile_directory = "../data/tennis_atp-master/"

# focus on most recent data; exclude Davis Cup stuff
startdate = '20050101'
enddate = '20161231'

engine = create_engine('mysql+mysqldb://' + username + ':' + password + '@localhost/' + db_name)
    
# get unique winners and losers in our set
players = tuple(pd.concat((data.winner_id,data.loser_id)).unique())

# load all data pertinant for any player
with engine.begin() as connection:
    matches_hist = pd.read_sql_query("""SELECT * FROM matches \
    WHERE tourney_date >= '""" + startdate + """' \
    AND tourney_date <= '""" + enddate + """' \
    AND (winner_id IN %(p)s \
    OR loser_id IN %(p)s) \
    AND tourney_name NOT LIKE 'Davis%%';""",connection,params={'p':players})

II. Calculate match statistics

Many of the features we will develop will involve quantities derived from previous matches. Examples include weighted historical averages of winning first serves, double faults, etc. This section calculates some important quantities for each match.

Insert index on matches_hist



In [1711]:

    
matches_hist['key'] = np.arange(len(matches_hist))

Extract text string indicating unusual match outcomes



In [1712]:

    
# scores are just numbers, unless something weird happened.  Extract comments about irregular outcomes.  
t=matches_hist.score.str.extractall('(?P<comment>[a-zA-Z]+.+)').xs(0,level='match')
matches_hist = pd.merge(matches_hist,t,how='outer',left_index=True, right_index=True)
matches_hist.comment.unique()









    Out[1712]:





array([nan, 'W/O', 'RET', 'DEF', 'Jun'], dtype=object)

Calculate games scores for each set, store in a separate dataframe



In [1713]:

    
# discard comments and trailing white space
scores = matches_hist.score.str.replace('(?P<comment>[a-zA-Z]+.+)','')
scores = scores.str.replace('(?P<comment>\([0-9]+\))','').str.strip()

# split the game scores into columns of a dataframe
scores = scores.str.split('-|\s',expand=True)
scores.columns=['W1','L1','W2','L2','W3','L3','W4','L4','W5','L5']
scores = scores.apply(lambda x: pd.to_numeric(x))

Store the number of games played in each match.



In [1714]:

    
ngames = np.sum(scores,axis=1)
matches_hist.insert(0,'ngames',ngames.astype('int'))

It seems a few matches were cut short for no recorded reason:



In [1715]:

    
# sanity check:  are matches involving few than 12 games identical to those with commments?
idx1 = (ngames<12)
idx2 = matches_hist.comment.notnull()
z=(idx1*1)*(idx2*1-1)
zz = np.where(np.abs(z))[0]
print("matches with weird outcomes:  ")
print(matches_hist.loc[zz,'score'])









    



matches with weird outcomes:  
6018     6-3 1-1
7245     6-3 1-0
7286        None
7466         4-0
7937     6-3 1-0
19303        4-1
Name: score, dtype: object

Calcuate first serve percentages (fsp) for both winners and losers.



In [1716]:

    
matches_hist.insert(0,'w_fsp',matches_hist.w_1stIn/matches_hist.w_svpt)
matches_hist.insert(0,'l_fsp',matches_hist.l_1stIn/matches_hist.l_svpt)
matches_hist.loc[matches_hist.]

Calculate winning first serve percentages (wfs)



In [1717]:

    
matches_hist.insert(0,'w_wfs',matches_hist.w_1stWon/matches_hist.w_svpt)
matches_hist.insert(0,'l_wfs',matches_hist.l_1stWon/matches_hist.l_svpt)

Calculate second serves in (2ndIn) for both winners and losers



In [1718]:

    
matches_hist.insert(0,'w_2ndIn',matches_hist.w_svpt-matches_hist.w_df-matches_hist.w_1stIn)
matches_hist.insert(0,'l_2ndIn',matches_hist.l_svpt-matches_hist.l_df-matches_hist.l_1stIn)

Calculate second serve (ssp) percentages



In [1719]:

    
matches_hist.insert(0,'w_ssp',matches_hist.w_2ndIn/(matches_hist.w_2ndIn+matches_hist.w_df))
matches_hist.insert(0,'l_ssp',matches_hist.l_2ndIn/(matches_hist.l_2ndIn+matches_hist.l_df))

Calculate wining second serve percentages (wss)



In [1720]:

    
matches_hist.insert(0,'w_wss',matches_hist.w_2ndWon/matches_hist.w_2ndIn)
matches_hist.insert(0,'l_wss',matches_hist.l_2ndWon/matches_hist.l_2ndIn)

Calculate overall win on serve percentages (wsp)



In [1860]:

    
#matches_hist.insert(0,'w_wsp',(matches_hist.w_1stWon + matches_hist.w_2ndWon)/matches_hist.w_svpt)
#matches_hist.insert(0,'l_wsp',(matches_hist.l_1stWon+matches_hist.l_2ndWon)/matches_hist.l_svpt)
matches_hist['w_wsp']=(matches_hist.w_1stWon + matches_hist.w_2ndWon)/matches_hist.w_svpt
matches_hist['l_wsp']=(matches_hist.l_1stWon+matches_hist.l_2ndWon)/matches_hist.l_svpt

Calculate winning on return percentages (wrp).

[(# of opponent serves) - (# of opponent service victories)]/(# of opponent serves)



In [1722]:

    
matches_hist.insert(0,'w_wrp',(matches_hist.l_svpt - matches_hist.l_1stWon \
                               - matches_hist.l_2ndWon)/(matches_hist.l_svpt))
matches_hist.insert(0,'l_wrp',(matches_hist.w_svpt - matches_hist.w_1stWon \
                              - matches_hist.w_2ndWon)/(matches_hist.w_svpt))

Calculate total points won percentage (tpw)



In [1723]:

    
matches_hist.insert(0,'w_tpw',(matches_hist.l_svpt\
                          -matches_hist.l_1stWon-matches_hist.l_2ndWon\
                          +matches_hist.w_1stWon +matches_hist.w_2ndWon)/\
                        (matches_hist.l_svpt + matches_hist.w_svpt))
matches_hist.insert(0,'l_tpw',(matches_hist.w_svpt\
                        -matches_hist.w_1stWon-matches_hist.w_2ndWon\
                        +matches_hist.l_1stWon +matches_hist.l_2ndWon)/\
                        (matches_hist.l_svpt + matches_hist.w_svpt))

Calculate double faults per game (dfpg), per game

There are a couple of bad entries for SvGms. We'll first fix those. (Note: the fix evenly partions total games between both players. This could result in a fractional number of service games. My hunch is the total effect on the stats is minimal.)



In [1828]:

    
idx = np.where(((matches_hist.w_SvGms == 0)|(matches_hist.l_SvGms==0)) & (matches_hist.ngames >1))
print(matches_hist.loc[idx[0],['w_df','l_df','w_SvGms','l_SvGms','score','ngames']])
matches_hist.loc[idx[0],'w_SvGms'] = matches_hist.ngames[idx[0]]/2
matches_hist.loc[idx[0],'l_SvGms'] = matches_hist.ngames[idx[0]]/2     
print(matches_hist.loc[idx[0],['w_df','l_df','w_SvGms','l_SvGms','score','ngames']])









    



       w_df  l_df  w_SvGms  l_SvGms        score  ngames
2956    4.0   2.0      0.0      0.0      6-3 6-3      18
9664    2.0  12.0     16.0      0.0  6-3 6-4 7-5      31
12770   0.0   0.0      0.0      4.0  6-2 6-2 6-1      23
18690   3.0   0.0     12.0      0.0      7-5 7-5      24
18695   1.0   0.0      9.0      0.0      6-3 6-3      18
       w_df  l_df  w_SvGms  l_SvGms        score  ngames
2956    4.0   2.0      9.0      9.0      6-3 6-3      18
9664    2.0  12.0     15.5     15.5  6-3 6-4 7-5      31
12770   0.0   0.0     11.5     11.5  6-2 6-2 6-1      23
18690   3.0   0.0     12.0     12.0      7-5 7-5      24
18695   1.0   0.0      9.0      9.0      6-3 6-3      18



In [1851]:

    
matches_hist.insert(0,'w_dfpg',matches_hist.w_df/matches_hist.w_SvGms)
matches_hist.insert(0,'l_dfpg',matches_hist.l_df/matches_hist.l_SvGms)
#matches_hist['w_dfpg']=matches_hist.w_df/matches_hist.w_SvGms
#matches_hist['l_dfpg']=matches_hist.l_df/matches_hist.l_SvGms

Calculate aces per game (acpg), per game



In [1855]:

    
matches_hist.insert(0,'w_acpg',matches_hist.w_ace/matches_hist.w_SvGms)
matches_hist.insert(0,'l_acpg',matches_hist.l_ace/matches_hist.l_SvGms)
#matches_hist['w_acpg']=matches_hist.w_ace/matches_hist.w_SvGms
#matches_hist['l_acpg']=matches_hist.l_ace/matches_hist.l_SvGms

Calculate break points saved percentage (bpsp)



In [ ]:

    
matches_hist.insert(0,'w_bps',matches_hist.w_bpSaved/matches_hist.w_bpFaced)
matches_hist.insert(0,'l_bps',matches_hist.l_bpSaved/matches_hist.l_bpFaced)

Flag games with premature closure, probably due to injury (retired)



In [1727]:

    
matches_hist.insert(0,'retired',0)
matches_hist.loc[(matches_hist.comment=='RET'),'retired']=1

Flag games won as a walkover (wo)



In [1728]:

    
matches_hist.insert(0,'walkover',0)
matches_hist.loc[(matches_hist.comment=='W/O'),'walkover']=1

Calculate player completeness (complete), defined as $$ wsp \times wrp $$

Note: might be more useful to aggregate first, rather than compute on a per-match basis.



In [1738]:

    
matches_hist.insert(0,'w_complete',matches_hist.w_wsp*matches_hist.w_wrp)
matches_hist.insert(0,'l_complete',matches_hist.l_wsp*matches_hist.l_wrp)

Calculate player service advantage (serveadv), defined as

$$ wsp_1 -wrp_2 $$

Note: as with complete, it might be more useful to aggregated first



In [1739]:

    
matches_hist.insert(0,'w_serveadv',matches_hist.w_wsp-matches_hist.l_wrp)
matches_hist.insert(0,'l_serveadv',matches_hist.l_wsp-matches_hist.w_wrp)

Sanity check: investigate calculated quantities



In [1861]:

    
idx = matches_hist.comment.isnull()
labels = ['dfpg', 'acpg', 'tpw', 'wrp', 'wsp', 'wss', 'wfs', 'fsp', 'ssp', 'bps','complete']
for label in labels:
    printstr = label + ": max for winner/loser is {:5.2f}/{:5.2f}, min for winner/loser is {:5.2f}/{:5.2f}"
    v1 = eval('matches_hist.w_' + label + '[idx].max()')
    v2 = eval('matches_hist.l_' + label + '[idx].max()')
    v3 = eval('matches_hist.w_' + label + '[idx].min()')
    v4 = eval('matches_hist.l_' + label + '[idx].min()')
    print(printstr.format(v1,v2,v3,v4))









    



dfpg: max for winner/loser is  1.44/ 2.00, min for winner/loser is  0.00/ 0.00
acpg: max for winner/loser is  2.80/ 2.55, min for winner/loser is  0.00/ 0.00
tpw: max for winner/loser is  0.79/ 0.56, min for winner/loser is  0.44/ 0.21
wrp: max for winner/loser is  0.82/ 0.53, min for winner/loser is  0.15/ 0.00
wsp: max for winner/loser is  1.00/ 0.85, min for winner/loser is  0.47/ 0.18
wss: max for winner/loser is  1.00/ 1.00, min for winner/loser is  0.00/ 0.00
wfs: max for winner/loser is  0.84/ 0.74, min for winner/loser is  0.22/ 0.08
fsp: max for winner/loser is  1.00/ 1.00, min for winner/loser is  0.29/ 0.24
ssp: max for winner/loser is  1.00/ 1.00, min for winner/loser is  0.25/ 0.36
bps: max for winner/loser is  1.00/ 1.00, min for winner/loser is  0.00/ 0.00
complete: max for winner/loser is  0.63/ 0.25, min for winner/loser is  0.11/ 0.00

III. Calculate features

Each feature generally requires some calculation. We'll use the bigger data set matches_hist to drive the calculations, and store the results in a data frame called features, of the same size as data.

Extract winner and loser data separately and concatenate into dataframe all_records with uniform column names.



In [1862]:

    
# extract winner stats
w_records = matches_hist[['winner_id',
                          'tourney_date',
                          'tourney_id',
                          'match_num',
                          'ngames',
                          'key',
                          'w_acpg', # avg. no of aces per game
                          'w_dfpg', # avg no. of double faults per game
                          'w_tpw',  # total points won
                          'w_wrp',  # wining return percent
                          'w_wsp',  # winning service percent
                          'w_wss',  # winning second serve percent
                          'w_wfs',  # winning first serve percent
                          'w_fsp',  # good first serves percent
                          'w_ssp',  # good second serves percent
                          'w_bps',  # breakpoints saved percent 
                          'retired',# 1 if loser retired prematurely
                          'walkover', # 1 if loser didn't show up
                          'surface', # 'Hard', 'Clay', or 'Grass'
                          'winner_age',  # age
                          'winner_ht',   # height
                          'winner_rank', # rank
                          'winner_rank_points' # rank points
                         ]]

# rename columns
newcols = {'winner_id':'pid', 
           'tourney_date':'date',
           'tourney_id':'tid',
           'match_num':'mid',
           'ngames':'ngames',
           'key':'key',
           'w_acpg':'acpg',
           'w_dfpg':'dfpg',
           'w_tpw':'tpw',
           'w_wrp':'wrp',
           'w_wsp':'wsp',
           'w_wss':'wss',
           'w_wfs':'wfs',
           'w_fsp':'fsp',
           'w_ssp':'ssp',
           'w_bps':'bps',
           'retired':'retired',
           'walkover':'walkover',
           'surface':'surface',
           'winner_age':'age',  
           'winner_ht':'ht',   
           'winner_rank':'rank', 
           'winner_rank_points':'rank_points'
          }
w_records = w_records.rename(columns = newcols)

# record that the outcome was a victory for these players
w_records['outcome'] = np.ones(len(w_records))

# extract loser stats
l_records = matches_hist[['loser_id',
                          'tourney_date',
                          'tourney_id',
                          'match_num',
                          'ngames',
                          'key',
                          'l_acpg', # avg. no of aces per game
                          'l_dfpg',  # avg no. of double faults per game
                          'l_tpw',  # total points won
                          'l_wrp',  # wining return percent
                          'l_wsp',  # winning service percent
                          'l_wss',  # winning second serve percent
                          'l_wfs',  # winning first serve percent
                          'l_fsp',  # percent of successful first serves
                          'l_ssp',  # percent of successful second serves
                          'l_bps',  # percent of breakpoints saved  
                          'retired',# 1 if loser retired prematurely
                          'walkover',# 1 if loser didn't show up
                          'surface', # 'Hard', 'Clay', or 'Grass'
                          'loser_age',  # age
                          'loser_ht',   # height
                          'loser_rank', # rank
                          'loser_rank_points' # rank points
                         ]]

# rename columns
newcols = {'loser_id':'pid', 
           'tourney_date':'date',
           'tourney_id':'tid',
           'match_num':'mid',
           'ngames':'ngames',
           'key':'key',
           'l_acpg':'acpg',
           'l_dfpg':'dfpg',
           'l_tpw':'tpw',
           'l_wrp':'wrp',
           'l_wsp':'wsp',
           'l_wss':'wss',
           'l_wfs':'wfs',
           'l_fsp':'fsp',
           'l_ssp':'ssp',
           'l_bps':'bps', 
           'retired':'retired',
           'walkover':'walkover',
           'surface':'surface',
           'loser_age':'age',  
           'loser_ht':'ht',   
           'loser_rank':'rank', 
           'loser_rank_points':'rank_points'
          }
l_records = l_records.rename(columns = newcols)

# record outcome as a loss
l_records['outcome'] = np.zeros(len(w_records))

# fuse all the data into one dataframe
all_records = pd.concat([w_records,l_records]).reset_index().sort_values(['key']).replace(np.inf,np.nan)

Calculate surface weighting matrix. Note that the resulting values are quite different from Sipko's.



In [1863]:

    
grouped = all_records.groupby(['pid','surface'])
t=grouped['outcome'].mean()
surf_wt = t.unstack(level=-1).corr()
surf_wt

Function to calculate static features. (Specifically, calculate normalized rank, rankpts, age, height, hand features for every match in the dataset. Operates on the data as a whole, rather than by line.)



In [1869]:

    
def  get_static_features(data):
    """
    Description: returns differences of those features that don't depend on match histories. 
    (Rank, Rankpoints, Height, Hand)
    
    Input:  dataframe with all the match data for which features are to be calculated
    Output:  another dataframe of the same length but only four columns, each with one feature
    """
    
    # boolean, 1 means (winner,loser) are (player1,player2), 0 means the reverse
    outcome = data['outcome'] 
    
    # features dataframe should include merge identifiers
    features = data[['tourney_id', 'match_num','tourney_date','key']].copy()
    
    # rank (normalize)
    rank=(data.loser_rank-data.winner_rank)*(-1)**outcome
    features.insert(0,'rank',rank/rank.std())

    # rank points (normalize)
    rankpts = (data.loser_rank_points-data.winner_rank_points)*(-1)**outcome
    features.insert(0,'rankpts',rankpts/rankpts.std())

    # height (normalize)
    height = (data.loser_ht-data.winner_ht)*(-1)**outcome
    features.insert(0,'height',height/height.std())   
    
    # age (normalize)
    height = (data.loser_age-data.winner_age)*(-1)**outcome
    features.insert(0,'age',height/height.std())  
    
    # hand (1 for right, 0 for left)
    hand = ((data.loser_hand=='R')*1-(data.winner_hand=='R')*1)*(-1)**outcome
    hand.iloc[np.where((data['winner_hand']=='U')|\
                      (data['loser_hand']=='U'))[0]]=np.nan
    features.insert(0,'hand',hand)
    
    return features

Get dynamic features (i.e. all features that require some time averaging.)



In [1879]:

    
def get_dynamic_features(x):
    """
    Input:  a row of the dataframe.  Needs to have the following fields:
        pid (player id number)
        tid (tournament id number)
        mid (match id number)
        date (match date)
        surface (match surface)
    Output: a dataframe of two columns, one a time discount, the other a surface discount 
    
    """
        
    # extract identifiers and date from input row
    pid = x['pid'] # player id 
    tid = x['tid'] # tourney id
    mid = x['mid'] # match id
    date = x['date']
    surface = x['surface']
    
    # extract all historical records for this player, from before this match
    records = all_records.loc[(all_records.pid==pid) & (all_records.date <= date) &\
                         ((all_records.tid != tid) | (all_records.mid != mid)),:].copy()
    
    # get time discount factor
    p = 0.8 
    t = (date - records.date).apply(lambda x: x.days/365)
    t_wt = p**t
    t_wt.loc[t_wt>p]=p
    
    # get surface discount factor
    s_wt = records.surface.apply(lambda x: surf_wt.loc[x,surface])

    # get time and court weighted averages of serve and performance stats
    t = records[['dfpg','acpg','tpw','wrp','wsp',\
              'wss','wfs','fsp','ssp','bps']].mul(t_wt*s_wt,axis=0).sum(axis=0)/\
        records[['dfpg','acpg','tpw','wrp','wsp',\
              'wss','wfs','fsp','ssp','bps']].notnull().mul(t_wt*s_wt,axis=0).sum(axis=0)
 
    if len(records)==0:
        t['complete']=np.nan
        t['retired']=np.nan
        return t

    # get player completeness
    t['complete'] = t['wsp']*t['wrp']

    # get player serveadvantage
    t['serveadv'] = t['wsp']+t['wrp']    
    
    # get player "return from retirement" status
    t['retired'] = records.loc[records.date==records.date.min(),'retired'].values[0]
    
    # return a series
    return t



In [1535]:

    
def dynamic_feature_wrapper(x):
    """
    calls "get_dynamic_features" to extract dynamic features for each player
    """    
    
    pids = x[['lid','wid']]
    y = x.copy()
    
    # get Player1 info
    y['pid'] = pids[y['outcome']]
    P1_features = get_dynamic_features(y)
    
    # get Player0 info
    y['pid'] = pids[1-y['outcome']]
    P2_features = get_dynamic_features(y)
     
    # features are differences
    features = P1_features - P2_features
    
    # compute service advantage
    features['serveadv'] = 
        
    return



In [1870]:

    
data['outcome']=np.random.choice([0,1],size=len(features))
s_features=get_static_features(data)
s_features









    Out[1870]:







  
    
      
      hand
      age
      height
      rankpts
      rank
      tourney_id
      match_num
      tourney_date
      key
    
  
  
    
      0
      0.0
      -0.941438
      -1.184513
      0.293678
      -3.651544
      2010-1536
      11
      2010-05-09
      0
    
    
      1
      0.0
      -1.058447
      0.296128
      -0.136147
      0.187259
      2010-1536
      7
      2010-05-09
      1
    
    
      2
      0.0
      0.203960
      -0.690966
      0.360682
      -0.499356
      2010-1536
      20
      2010-05-09
      2
    
    
      3
      0.0
      1.050396
      -0.197419
      0.619075
      -0.460344
      2010-1536
      22
      2010-05-09
      3
    
    
      4
      0.0
      -1.521115
      1.480641
      -0.000356
      0.007802
      2010-1536
      17
      2010-05-09
      4
    
    
      5
      1.0
      -0.127744
      0.197419
      -0.183905
      0.163851
      2010-1536
      9
      2010-05-09
      5
    
    
      6
      0.0
      1.243085
      0.987094
      -0.002851
      0.007802
      2010-1536
      13
      2010-05-09
      6
    
    
      7
      0.0
      0.388062
      1.283223
      0.078409
      -0.054617
      2010-1536
      15
      2010-05-09
      7
    
    
      8
      1.0
      -1.114805
      NaN
      0.561694
      -2.286116
      2010-1536
      21
      2010-05-09
      8
    
    
      9
      0.0
      -0.953783
      1.283223
      -0.636182
      0.249678
      2010-1536
      3
      2010-05-09
      9
    
    
      10
      0.0
      0.738551
      0.197419
      -0.247701
      0.296493
      2010-1536
      18
      2010-05-09
      10
    
    
      11
      0.0
      -0.838385
      -0.197419
      0.007128
      -0.031210
      2010-1536
      12
      2010-05-09
      11
    
    
      12
      0.0
      -1.756743
      -2.270317
      -0.219189
      0.288690
      2010-1536
      1
      2010-05-09
      12
    
    
      13
      -1.0
      0.977937
      -1.974189
      0.290826
      -0.522764
      2010-1536
      4
      2010-05-09
      13
    
    
      14
      0.0
      -0.707421
      1.480641
      -0.020671
      0.101432
      2010-1536
      24
      2010-05-09
      14
    
    
      15
      1.0
      -0.053137
      -0.493547
      -0.314705
      0.522764
      2010-1536
      16
      2010-05-09
      15
    
    
      16
      0.0
      -0.580214
      -0.000000
      0.240573
      -0.257481
      2010-1536
      2
      2010-05-09
      16
    
    
      17
      0.0
      0.092856
      -0.987094
      0.085537
      -0.039012
      2010-1536
      10
      2010-05-09
      17
    
    
      18
      0.0
      0.426707
      0.296128
      0.194597
      -0.444739
      2010-1536
      19
      2010-05-09
      18
    
    
      19
      0.0
      -0.620469
      -0.690966
      -0.130088
      0.280888
      2010-1536
      14
      2010-05-09
      19
    
    
      20
      0.0
      1.233424
      -0.789675
      -0.467959
      0.358912
      2010-1536
      6
      2010-05-09
      20
    
    
      21
      0.0
      -1.247379
      3.553539
      0.512866
      -0.944096
      2010-1536
      23
      2010-05-09
      21
    
    
      22
      0.0
      -0.806717
      0.987094
      -0.226673
      0.234073
      2010-1536
      8
      2010-05-09
      22
    
    
      23
      0.0
      -0.923726
      -0.987094
      -0.028869
      0.054617
      2010-1536
      5
      2010-05-09
      23
    
    
      24
      0.0
      -0.028447
      0.690966
      3.404022
      -0.351110
      2010-1536
      25
      2010-05-09
      24
    
    
      25
      0.0
      0.417582
      -0.493547
      0.304013
      -0.296493
      2010-1536
      34
      2010-05-09
      25
    
    
      26
      0.0
      1.211954
      -0.690966
      0.390620
      -0.163851
      2010-1536
      30
      2010-05-09
      26
    
    
      27
      -1.0
      -0.675753
      0.789675
      0.293678
      -0.553974
      2010-1536
      32
      2010-05-09
      27
    
    
      28
      0.0
      -1.510917
      -0.000000
      1.684369
      -0.366715
      2010-1536
      38
      2010-05-09
      28
    
    
      29
      1.0
      0.103054
      NaN
      0.346069
      -2.161277
      2010-1536
      28
      2010-05-09
      29
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      18418
      1.0
      0.082121
      -0.493547
      -1.420273
      0.312098
      2016-M052
      30
      2016-02-15
      18649
    
    
      18419
      -1.0
      0.646769
      NaN
      -0.429111
      0.405727
      2016-M052
      29
      2016-02-15
      18650
    
    
      18420
      -1.0
      0.476623
      0.493547
      1.313351
      -0.218468
      2016-M052
      27
      2016-02-15
      18651
    
    
      18421
      -1.0
      -0.932851
      0.987094
      0.000000
      0.007802
      2016-M052
      26
      2016-02-15
      18652
    
    
      18422
      1.0
      0.936071
      NaN
      -0.083399
      0.366715
      2016-M052
      25
      2016-02-15
      18653
    
    
      18423
      0.0
      2.239271
      NaN
      0.853589
      -0.101432
      2016-M052
      28
      2016-02-15
      18654
    
    
      18424
      1.0
      0.153507
      -0.197419
      -1.442014
      0.374517
      2016-M052
      23
      2016-02-15
      18655
    
    
      18425
      0.0
      0.208254
      NaN
      0.170361
      -0.312098
      2016-M052
      21
      2016-02-15
      18656
    
    
      18426
      1.0
      1.648859
      NaN
      0.270511
      -2.286116
      2016-M052
      24
      2016-02-15
      18657
    
    
      18427
      1.0
      1.726149
      -0.690966
      -0.018533
      0.031210
      2016-M052
      22
      2016-02-15
      18658
    
    
      18428
      0.0
      0.351563
      0.690966
      -0.364602
      0.733430
      2016-M052
      18
      2016-02-15
      18659
    
    
      18429
      -1.0
      -0.484138
      NaN
      0.026018
      -0.093629
      2016-M052
      19
      2016-02-15
      18660
    
    
      18430
      0.0
      0.205571
      NaN
      -0.466534
      0.553974
      2016-M052
      17
      2016-02-15
      18661
    
    
      18431
      1.0
      1.135737
      -1.283223
      1.228527
      -0.319900
      2016-M052
      20
      2016-02-15
      18662
    
    
      18432
      -1.0
      1.001016
      NaN
      1.468744
      -0.475949
      2016-M052
      15
      2016-02-15
      18663
    
    
      18433
      -1.0
      0.697759
      NaN
      -0.040630
      0.148246
      2016-M052
      14
      2016-02-15
      18664
    
    
      18434
      0.0
      0.619396
      NaN
      0.037779
      -0.163851
      2016-M052
      2
      2016-02-15
      18665
    
    
      18435
      -1.0
      0.167999
      0.789675
      0.001782
      -0.007802
      2016-M052
      11
      2016-02-15
      18666
    
    
      18436
      1.0
      1.787874
      NaN
      1.002923
      -2.567004
      2016-M052
      16
      2016-02-15
      18667
    
    
      18437
      -1.0
      -0.814768
      NaN
      -0.114050
      0.436937
      2016-M052
      13
      2016-02-15
      18668
    
    
      18438
      0.0
      -2.115821
      NaN
      -0.081973
      0.335505
      2016-M052
      7
      2016-02-15
      18669
    
    
      18439
      -1.0
      0.386451
      0.493547
      -0.240573
      0.179456
      2016-M052
      6
      2016-02-15
      18670
    
    
      18440
      0.0
      -0.421876
      0.296128
      -0.239860
      0.210666
      2016-M052
      8
      2016-02-15
      18671
    
    
      18441
      0.0
      -0.266222
      -0.000000
      -0.116901
      0.491554
      2016-M052
      12
      2016-02-15
      18672
    
    
      18442
      0.0
      0.586655
      0.987094
      0.084112
      -0.452542
      2016-M052
      3
      2016-02-15
      18673
    
    
      18443
      -1.0
      -0.991355
      NaN
      -0.662556
      0.468147
      2016-M052
      1
      2016-02-15
      18674
    
    
      18444
      0.0
      1.492131
      NaN
      -0.413429
      0.335505
      2016-M052
      9
      2016-02-15
      18675
    
    
      18445
      0.0
      -0.827650
      NaN
      0.101575
      -0.608591
      2016-M052
      5
      2016-02-15
      18676
    
    
      18446
      -1.0
      0.480380
      0.789675
      0.067717
      -0.234073
      2016-M052
      4
      2016-02-15
      18677
    
    
      18447
      0.0
      -2.651485
      NaN
      -1.515077
      3.799791
      2016-M052
      10
      2016-02-15
      18678
    
  

18448 rows × 9 columns



In [1887]:

    
x=data[['tourney_id','match_num','tourney_date','key','winner_id','loser_id','surface','outcome']].copy()
x.rename(columns={'tourney_id':'tid','match_num':'mid','tourney_date':'date',\
                    'winner_id':'wid','loser_id':'lid'},inplace=True)



In [1892]:

    
x.iloc[0:5,:].apply(dynamic_feature_wrapper,axis=1)
#x.iloc[0:5,:]



In [1886]:

    
y = matches_hist.iloc[15000]
dynamic_feature_wrapper(y)









    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-1886-ef9aba4445b7> in <module>()
      1 y = matches_hist.iloc[15000]
----> 2 dynamic_feature_wrapper(y)

<ipython-input-1535-14e39c096a97> in dynamic_feature_wrapper(x)
      4     """    
      5 
----> 6     pids = x[['lid','wid']]
      7     y = x.copy()
      8 

/Users/ctoews/anaconda/envs/py3_6_2/lib/python3.6/site-packages/pandas/core/series.py in __getitem__(self, key)
    640             key = check_bool_indexer(self.index, key)
    641 
--> 642         return self._get_with(key)
    643 
    644     def _get_with(self, key):

/Users/ctoews/anaconda/envs/py3_6_2/lib/python3.6/site-packages/pandas/core/series.py in _get_with(self, key)
    681                     # handle the dup indexing case (GH 4246)
    682                     if isinstance(key, (list, tuple)):
--> 683                         return self.loc[key]
    684 
    685                     return self.reindex(key)

/Users/ctoews/anaconda/envs/py3_6_2/lib/python3.6/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1326         else:
   1327             key = com._apply_if_callable(key, self.obj)
-> 1328             return self._getitem_axis(key, axis=0)
   1329 
   1330     def _is_scalar_access(self, key):

/Users/ctoews/anaconda/envs/py3_6_2/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1539                     raise ValueError('Cannot index with multidimensional key')
   1540 
-> 1541                 return self._getitem_iterable(key, axis=axis)
   1542 
   1543             # nested tuple slicing

/Users/ctoews/anaconda/envs/py3_6_2/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis)
   1079     def _getitem_iterable(self, key, axis=0):
   1080         if self._should_validate_iterable(axis):
-> 1081             self._has_valid_type(key, axis)
   1082 
   1083         labels = self.obj._get_axis(axis)

/Users/ctoews/anaconda/envs/py3_6_2/lib/python3.6/site-packages/pandas/core/indexing.py in _has_valid_type(self, key, axis)
   1416 
   1417                 raise KeyError("None of [%s] are in the [%s]" %
-> 1418                                (key, self.obj._get_axis_name(axis)))
   1419 
   1420             return True

KeyError: "None of [['lid', 'wid']] are in the [index]"



In [1883]:

    
x = all_records.iloc[13002]
records = get_dynamic_features(x)
records









    Out[1883]:





dfpg        0.207208
acpg        0.619216
tpw         0.521576
wrp         0.378473
wsp         0.669557
wss         0.593572
wfs         0.470833
fsp         0.632490
ssp         0.910408
bps         0.644857
complete    0.253410
retired     0.000000
dtype: float64

It will be help with indexing if we isolate all relevant features for players 1 and 0 into their own dataframes. This is what the following code does.



In [408]:

    
# initialize dataframes to hold features for players 1 and 0
P1 = pd.DataFrame(columns=['DATE','TID','MID','PID','HAND','HT',\
                           'AGE','RANKPTS','RANK','ACE','DF','SVPT',\
                           'FSTIN','FSTWON','SNDWON','BPSAVED','BPFACED'])
P0 = pd.DataFrame(columns=['DATE','TID','MID','PID','HAND','HT',
                           'AGE','RANKPTS','RANK','ACE','DF','SVPT',\
                           'FSTIN','FSTWON','SNDWON','BPSAVED','BPFACED'])

# define a function that returns winner info if RES=1, otherwise loser info
def assign_player_1(x):
  
    winner = pd.Series({'DATE':x['tourney_date'],\
                        'TID':x['tourney_id'],\
                         'MID':x['match_num'],\
                         'PID':x['winner_id'],\
                         'HAND':x['winner_hand'],\
                         'HT':x['winner_ht'],\
                         'AGE':x['winner_age'],\
                         'RANKPTS':x['winner_rank_points'],\
                         'RANK':x['winner_rank'],\
                         'ACE':x['w_ace'],\
                         'DF':x['w_df'],\
                         'SVPT':x['w_svpt'],\
                         'FSTIN':x['w_1stIn'],\
                         'FSTWON':x['w_1stWon'],\
                         'SNDWON':x['w_2ndWon'],\
                         'BPSAVED':x['w_bpSaved'],\
                         'BPFACED':x['w_bpFaced']})
    
    loser = pd.Series({'DATE':x['tourney_date'],\
                       'TID':x['tourney_id'],\
                         'MID':x['match_num'],\
                         'PID':x['loser_id'],\
                         'HAND':x['loser_hand'],\
                         'HT':x['loser_ht'],\
                         'AGE':x['loser_age'],\
                         'RANKPTS':x['loser_rank_points'],\
                         'RANK':x['loser_rank'],\
                         'ACE':x['l_ace'],\
                         'DF':x['l_df'],\
                         'SVPT':x['l_svpt'],\
                         'FSTIN':x['l_1stIn'],\
                         'FSTWON':x['l_1stWon'],\
                         'SNDWON':x['l_2ndWon'],\
                         'BPSAVED':x['l_bpSaved'],\
                         'BPFACED':x['l_bpFaced']})
    
    if x['RES']==1:
        return winner
    else:
        return loser
                         
# mutatis mutandis for player 0.  (Note:  no need to rewrite this function if I can figure 
# out how to assign two outputs within an "apply" call.)      
def assign_player_0(x):
  
    winner = pd.Series({'DATE':x['tourney_date'],\
                        'TID':x['tourney_id'],\
                         'MID':x['match_num'],\
                         'PID':x['winner_id'],\
                         'HAND':x['winner_hand'],\
                         'HT':x['winner_ht'],\
                         'AGE':x['winner_age'],\
                         'RANKPTS':x['winner_rank_points'],\
                         'RANK':x['winner_rank'],\
                         'ACE':x['w_ace'],\
                         'DF':x['w_df'],\
                         'SVPT':x['w_svpt'],\
                         'FSTIN':x['w_1stIn'],\
                         'FSTWON':x['w_1stWon'],\
                         'SNDWON':x['w_2ndWon'],\
                         'BPSAVED':x['w_bpSaved'],\
                         'BPFACED':x['w_bpFaced']})
    
    loser = pd.Series({'DATE':x['tourney_date'],\
                       'TID':x['tourney_id'],\
                         'MID':x['match_num'],\
                         'PID':x['loser_id'],\
                         'HAND':x['loser_hand'],\
                         'HT':x['loser_ht'],\
                         'AGE':x['loser_age'],\
                         'RANKPTS':x['loser_rank_points'],\
                         'RANK':x['loser_rank'],\
                         'ACE':x['l_ace'],\
                         'DF':x['l_df'],\
                         'SVPT':x['l_svpt'],\
                         'FSTIN':x['l_1stIn'],\
                         'FSTWON':x['l_1stWon'],\
                         'SNDWON':x['l_2ndWon'],\
                         'BPSAVED':x['l_bpSaved'],\
                         'BPFACED':x['l_bpFaced']})
    
    if x['RES']==1:
        return loser
    else:
        return winner



In [409]:

    
matches_hist.insert(len(matches_hist.columns),'RES',features['RES'].values)
P1=matches_hist.apply(assign_player_1,axis=1)
P0=matches_hist.apply(assign_player_0,axis=1)

Features I and II: differences of ranks and rank points (RPTS)

We'll scale the rank point differences by the standard deviation.



In [ ]:

    
features.insert(len(features.columns), 'RANKPTS', P1['RANKPTS']-P0['RANKPTS'])
features.insert(len(features.columns), 'RANK', P1['RANK']-P0['RANK'])

features['RANKPTS'] = features['RANKPTS']/features['RANKPTS'].std()
features['RANK'] = features['RANK']/features['RANK'].std()



In [450]:

    
# define figure and axes
fig = plt.figure(figsize=(15,5))
ax0 = fig.add_subplot(121)
ax1 = fig.add_subplot(122)

ax0.hist(features.RANK.dropna())
ax0.set_title('Diff. in rank')
ax1.hist(features.RANKPTS.dropna())
ax1.set_title('Diff in rank pts')









    Out[450]:





<matplotlib.text.Text at 0x13fa47c18>

Feature III: differences of first serve winning percentage



In [426]:

    
P1.insert(len(P1.columns),'FSWPCT',P1['FSTWON']/P1['SVPT'])
P0.insert(len(P0.columns),'FSWPCT',P0['FSTWON']/P0['SVPT'])

P1_grouped = P1.groupby('PID')
P0_grouped = P0.groupby('PID')

def extract_features(group):
    mean_fswpct = group['FSWPCT'].mean()
    size = len(group)
    return pd.Series({'mean_fswpct':mean_fswpct,'size':size})

t1=P1_grouped.apply(extract_features).reset_index()
t0=P0_grouped.apply(extract_features).reset_index()

t2 = pd.merge(t1,t0,how='outer',on='PID')
t2 = t2.fillna(0)
t2['FSWPCT_HIST'] = (t2['mean_fswpct_x']*t2['size_x'] +\
                t2['mean_fswpct_y']*t2['size_y'])/(t2['size_x']+t2['size_y'])



In [428]:

    
P1=pd.merge(P1,t2[['PID','FSWPCT_HIST']],how='inner',on='PID')    
P0=pd.merge(P0,t2[['PID','FSWPCT_HIST']],how='inner',on='PID')



In [ ]:

    
features['FSWPCT']=P1['FSWPCT']-P0['FSWPCT']



In [451]:

    
plt.hist(features.FSWPCT.dropna())
plt.title('Diff. in first serve winning percentages')









    Out[451]:





<matplotlib.text.Text at 0x13f2a9d68>

Feature 4: Height differences



In [ ]:

    
features['HT'] = P1['HT']-P2['HT']
features['HT'] = features['HT']/features['HT'].std()



In [452]:

    
plt.hist(features.HT.dropna())
plt.title('Difference in height')









    Out[452]:





<matplotlib.text.Text at 0x13fdb0a58>

Feature 5: Age differences



In [ ]:

    
features['AGE'] = P1['AGE']-P2['AGE']
features['AGE'] = features['AGE']/features['AGE'].std()



In [456]:

    
plt.hist(features.AGE.dropna())
plt.title('Difference in age')









    Out[456]:





<matplotlib.text.Text at 0x12177f438>

Future Work

Proposed features:

percent of winning service returns (derivable from existing data)
percent of winning tie-breakers
percent of upsets (losing when higher ranked, winning when lower ranked)
percent of double faults per game
percent aces per game
percent head-to-head victories (against the same player)
advantage when serving
time since injury

Parameters to solve for:

Time discount factor over historical averages
Match weighting factor. Issues include:
-heads-to-heads
-common opponents
-court surface

Take-aways:

rank is not the whole picture
-Ex: David Nalbandian has gone 8-11 against Federer, in spite of being lower ranked
the data is messy
-incomplete matches
-missing data
-incorrect data
informative features need to be constructed
-infering injury
-head-to-head history



In [444]:

    
P1=pd.merge(P1,t2[['PID','SSWPCT_HIST']],how='inner',on='PID')    
P0=pd.merge(P0,t2[['PID','SSWPCT_HIST']],how='inner',on='PID')

surface	Carpet	Clay	Grass	Hard
surface
Carpet	1.000000	0.342744	0.295605	0.506449
Clay	0.342744	1.000000	0.363446	0.466821
Grass	0.295605	0.363446	1.000000	0.548772
Hard	0.506449	0.466821	0.548772	1.000000

	dfpg	acpg	tpw	wrp	wsp	wss	wfs	fsp	ssp	bps	complete
0	0.006432	0.301983	-0.017329	-0.040883	0.006915	-0.008153	-0.021032	-0.058883	0.006722	-0.011569	-0.023750
1	-0.024968	0.122376	0.011806	-0.015830	0.042181	0.018513	0.005151	-0.051550	0.023893	0.021771	0.005523
2	-0.046246	0.185565	0.030653	0.015456	0.052124	0.045170	0.029246	-0.001637	0.011653	0.033122	0.029887
3	0.022230	-0.097171	0.042182	0.060434	0.031264	0.044102	0.026328	0.022129	-0.010285	0.042440	0.049001
4	0.003079	0.428348	0.005904	-0.015185	0.028642	0.013616	0.043240	0.031019	-0.017860	0.060902	0.001187

	hand	age	height	rankpts	rank	tourney_id	match_num	tourney_date	key
0	0.0	-0.941438	-1.184513	0.293678	-3.651544	2010-1536	11	2010-05-09	0
1	0.0	-1.058447	0.296128	-0.136147	0.187259	2010-1536	7	2010-05-09	1
2	0.0	0.203960	-0.690966	0.360682	-0.499356	2010-1536	20	2010-05-09	2
3	0.0	1.050396	-0.197419	0.619075	-0.460344	2010-1536	22	2010-05-09	3
4	0.0	-1.521115	1.480641	-0.000356	0.007802	2010-1536	17	2010-05-09	4
5	1.0	-0.127744	0.197419	-0.183905	0.163851	2010-1536	9	2010-05-09	5
6	0.0	1.243085	0.987094	-0.002851	0.007802	2010-1536	13	2010-05-09	6
7	0.0	0.388062	1.283223	0.078409	-0.054617	2010-1536	15	2010-05-09	7
8	1.0	-1.114805	NaN	0.561694	-2.286116	2010-1536	21	2010-05-09	8
9	0.0	-0.953783	1.283223	-0.636182	0.249678	2010-1536	3	2010-05-09	9
10	0.0	0.738551	0.197419	-0.247701	0.296493	2010-1536	18	2010-05-09	10
11	0.0	-0.838385	-0.197419	0.007128	-0.031210	2010-1536	12	2010-05-09	11
12	0.0	-1.756743	-2.270317	-0.219189	0.288690	2010-1536	1	2010-05-09	12
13	-1.0	0.977937	-1.974189	0.290826	-0.522764	2010-1536	4	2010-05-09	13
14	0.0	-0.707421	1.480641	-0.020671	0.101432	2010-1536	24	2010-05-09	14
15	1.0	-0.053137	-0.493547	-0.314705	0.522764	2010-1536	16	2010-05-09	15
16	0.0	-0.580214	-0.000000	0.240573	-0.257481	2010-1536	2	2010-05-09	16
17	0.0	0.092856	-0.987094	0.085537	-0.039012	2010-1536	10	2010-05-09	17
18	0.0	0.426707	0.296128	0.194597	-0.444739	2010-1536	19	2010-05-09	18
19	0.0	-0.620469	-0.690966	-0.130088	0.280888	2010-1536	14	2010-05-09	19
20	0.0	1.233424	-0.789675	-0.467959	0.358912	2010-1536	6	2010-05-09	20
21	0.0	-1.247379	3.553539	0.512866	-0.944096	2010-1536	23	2010-05-09	21
22	0.0	-0.806717	0.987094	-0.226673	0.234073	2010-1536	8	2010-05-09	22
23	0.0	-0.923726	-0.987094	-0.028869	0.054617	2010-1536	5	2010-05-09	23
24	0.0	-0.028447	0.690966	3.404022	-0.351110	2010-1536	25	2010-05-09	24
25	0.0	0.417582	-0.493547	0.304013	-0.296493	2010-1536	34	2010-05-09	25
26	0.0	1.211954	-0.690966	0.390620	-0.163851	2010-1536	30	2010-05-09	26
27	-1.0	-0.675753	0.789675	0.293678	-0.553974	2010-1536	32	2010-05-09	27
28	0.0	-1.510917	-0.000000	1.684369	-0.366715	2010-1536	38	2010-05-09	28
29	1.0	0.103054	NaN	0.346069	-2.161277	2010-1536	28	2010-05-09	29
...	...	...	...	...	...	...	...	...	...
18418	1.0	0.082121	-0.493547	-1.420273	0.312098	2016-M052	30	2016-02-15	18649
18419	-1.0	0.646769	NaN	-0.429111	0.405727	2016-M052	29	2016-02-15	18650
18420	-1.0	0.476623	0.493547	1.313351	-0.218468	2016-M052	27	2016-02-15	18651
18421	-1.0	-0.932851	0.987094	0.000000	0.007802	2016-M052	26	2016-02-15	18652
18422	1.0	0.936071	NaN	-0.083399	0.366715	2016-M052	25	2016-02-15	18653
18423	0.0	2.239271	NaN	0.853589	-0.101432	2016-M052	28	2016-02-15	18654
18424	1.0	0.153507	-0.197419	-1.442014	0.374517	2016-M052	23	2016-02-15	18655
18425	0.0	0.208254	NaN	0.170361	-0.312098	2016-M052	21	2016-02-15	18656
18426	1.0	1.648859	NaN	0.270511	-2.286116	2016-M052	24	2016-02-15	18657
18427	1.0	1.726149	-0.690966	-0.018533	0.031210	2016-M052	22	2016-02-15	18658
18428	0.0	0.351563	0.690966	-0.364602	0.733430	2016-M052	18	2016-02-15	18659
18429	-1.0	-0.484138	NaN	0.026018	-0.093629	2016-M052	19	2016-02-15	18660
18430	0.0	0.205571	NaN	-0.466534	0.553974	2016-M052	17	2016-02-15	18661
18431	1.0	1.135737	-1.283223	1.228527	-0.319900	2016-M052	20	2016-02-15	18662
18432	-1.0	1.001016	NaN	1.468744	-0.475949	2016-M052	15	2016-02-15	18663
18433	-1.0	0.697759	NaN	-0.040630	0.148246	2016-M052	14	2016-02-15	18664
18434	0.0	0.619396	NaN	0.037779	-0.163851	2016-M052	2	2016-02-15	18665
18435	-1.0	0.167999	0.789675	0.001782	-0.007802	2016-M052	11	2016-02-15	18666
18436	1.0	1.787874	NaN	1.002923	-2.567004	2016-M052	16	2016-02-15	18667
18437	-1.0	-0.814768	NaN	-0.114050	0.436937	2016-M052	13	2016-02-15	18668
18438	0.0	-2.115821	NaN	-0.081973	0.335505	2016-M052	7	2016-02-15	18669
18439	-1.0	0.386451	0.493547	-0.240573	0.179456	2016-M052	6	2016-02-15	18670
18440	0.0	-0.421876	0.296128	-0.239860	0.210666	2016-M052	8	2016-02-15	18671
18441	0.0	-0.266222	-0.000000	-0.116901	0.491554	2016-M052	12	2016-02-15	18672
18442	0.0	0.586655	0.987094	0.084112	-0.452542	2016-M052	3	2016-02-15	18673
18443	-1.0	-0.991355	NaN	-0.662556	0.468147	2016-M052	1	2016-02-15	18674
18444	0.0	1.492131	NaN	-0.413429	0.335505	2016-M052	9	2016-02-15	18675
18445	0.0	-0.827650	NaN	0.101575	-0.608591	2016-M052	5	2016-02-15	18676
18446	-1.0	0.480380	0.789675	0.067717	-0.234073	2016-M052	4	2016-02-15	18677
18447	0.0	-2.651485	NaN	-1.515077	3.799791	2016-M052	10	2016-02-15	18678