Predicting Professional Tennis Match Outcomes

Author: Carl Toews
Date: August 8, 2017

Project Description: This project explores various machine learning techniques on professional tennis data. The data set was compiled by Jeff Sackman (https://github.com/JeffSackmann/tennis_atp) and consists of the Association of Tennis Professionals (ATP) match outcomes for all major ATP matches since 1968. In all there are over half a million records. Features change slightly over the years, but by 2017 include almost 50 elements, including rank, age, seed, and match statistics for both the winner and loser.

Technical specs: I've tested this notebook on Python 3.6.2 and MySQL 5.7.16.

Outline:
I. MySQL database setup
II. One-dimensional logistic regression demo: predicting with rank differences
III. $n$-dimensional logistic regression demo: predicting with other features
III. SVM demo: age and height difference
IV. TODOs

I: MySQL database setup

The data on Sackman's page is in CSV format. While it's easy enough to read such data directly into a pandas data structure, one of my goals in this project is to get some practice with the Pandas-MySQL interface. To set the stage for this work, I first need to transfer all the data from CSV files to a MySQL database. Since this process is involved, I have included it in a separate notebook called setup_mysql_database.

If the data is already in a MySQL database, skip this notebook, but if it is not, or you are not sure, run it. It will first check to see whether the CSV data has already been put into a MySQL database, and if it has not, it will create such a database. In order to run the notebook, you will need the username and password of a MySQL user that has permission to create databases and tables.

Once the MySQL database is in place, you can run the following cells, which establish the connectivity needed to execute the remainder of this notebook.

Configure parameters

Adjust as necessary to accomodate different paths, directories, and usernames.



In [743]:

    
# name of database
db_name = "tennis"
# name of db user
username = "testuser"
# db password for db user
password = "test623"
# location of atp data files
atpfile_directory = "./data/tennis_atp-master/"
# location of odds data files
oddsfiles_directory = "./data/odds_data/"

Import packages

All import statements related to connectivity and error analysis are here.



In [744]:

    
import sqlalchemy # pandas-mysql interface library
import sqlalchemy.exc # exception handling
from   sqlalchemy import create_engine  # needed to define db interface
import sys # for defining behavior under errors
from IPython.core.debugger import Tracer
#%qtconsole

Establish database connection

Try to connect to the tennis database on the local mysql host. If successful, print out the MySQL version number, if unsuccessful, exit gracefully.



In [745]:

    
# create an engine for interacting with the MySQL database
try:
    eng_str = 'mysql+mysqldb://' + username + ':' + password + '@localhost/' + db_name
    engine = create_engine(eng_str)
    connection = engine.connect()
    version = connection.execute("SELECT VERSION()")
    print("Database version : ")
    print(version.fetchone())
    
# report what went wrong if this fails.    
except sqlalchemy.exc.DatabaseError as e:
    reason = e.message
    print("Error %s:" % (reason))
    sys.exit(1)

# close the connection
finally:            
    if connection:    
        connection.close()
    else:
        print("Failed to create connection.")









    



Database version : 
('5.7.16',)

II. One-dimensional Logistic regression: rank as predictor

We start out by performing logistic regression on rank alone, with the aim of seeing to what extent rank difference can be used as predictive tool. The development is similar to the one here:

Clarke and Kyte, "Using official ratings to simulate major tennis tournaments", International Transactions in Operational Research, 2000.

Our work flow will involve SQL queries, pandas data frames, and numpy arrays, more or less as follows:

use SQL query to extract data and store it in a pandas data frame
use the pandas data frame to manipulate data and extract features
extract preprocessed data into a numpy array for computational work

Import Statements

We will need some scientific computing libraries (scipy), some data handling libraries (pandas), and some plotting functionality (matplotlib)



In [746]:

    
import numpy as np # numerical libraries
import scipy as sp
import pandas as pd # for data analysis
import pandas.io.sql as sql # for interfacing with MySQL database
from scipy import linalg # linear algebra libraries
from scipy import optimize
import matplotlib as mpl # a big library with plotting functionality
import matplotlib.pyplot as plt # a subset of matplotlib with most of the useful tools
import IPython as IP
%matplotlib inline 
import pdb

Data extraction

We'll focus on data since 2011, since that is the data for which I have information on the betting markets. The MySQL query avoids any match for which one or another of the players has no rank points. We extract winner rank points and loser rank points.



In [ ]:

    
# extract from MySQL database info on rank points and height for both winner and loser, store in dataframe
with engine.begin() as connection:
    rawdata = pd.read_sql_query("""SELECT winner_rank_points, loser_rank_points FROM matches \
                            WHERE tourney_date < '20150101' \
                            AND tourney_date > '20110101'
                            AND winner_rank_points IS NOT NULL \
                            AND loser_rank_points IS NOT NULL""", connection)

Data processing

If the winner had a higher rank than the loser, the 'rank prediction' was considered a 'success', otherwise it is a 'failure'. For analysis, we store only the absolute value of the rank difference, along with the outcome of the match. We also scale the data for numerical well-behavedness.



In [ ]:

    
# winner rank minus loser rank
rank_diff = (rawdata['winner_rank_points'] - rawdata['loser_rank_points']).values
# index variable:  True if higher ranked player won, False otherwise
y = (rank_diff > 0)
# final dataset with two cols: difference in rankings, high ranked height minus low ranked height  
X = np.abs(rank_diff)
# for numerical well-behavedness, we need to scale and center the data
#X1=(X-np.mean(X,0))/np.std(X,axis=0)
X=X/np.std(X,axis=0)

Define helper functions

To perform the regression, we'll need to define the sigmoid function and a cost function. The former can take a scalar, vector, or matrix, and return the elementwise value of

$$ \frac{1}{1+e^{-z}} $$



In [ ]:

    
def sigmoid(z):
    '''
    Usage:  sigmoid(z)
    
    Description:  Computes value of sigmoid function for scalar.  
    For vector or matrix, computes values of sigmoid function for each entry.
    '''

    return 1/(1+np.exp(-z));

The cost function is designed to take a regularization parameter lambda. For a non-regularized solution, lambda can be set equal to 0. The cost function returns both a cost and the gradient for any given value of parameters $\theta$.



In [ ]:

    
# define a cost function
def costFunction(theta,X,y,lam,offset=True):
    '''
    Computes the cost and gradient for logistic regression.
    
    Input: 
           theta (mx1 numpy array of parameters)
           X (nxm numpy array of feature values, first column all 1s)
           y (nx1 boolean array of outcomes, 1=higher ranked player won, 0 otherwise)
           lam (scalar:  regularization paramter)
           offset (bool:  True of first element of theta represents a translation, False otherwise)
           
    Output:  
           cost (scalar value of cost)
    '''

    # number of data points
    n = len(y) 
    #number of parameters
    m = len(theta)   
 
    # make sure vectors are column vectors for use of "np.dot"
    theta = theta.reshape(-1,1)
    y = y.reshape(-1,1)
    X = X.reshape(-1,1) if m==1 else X
    
    # input to sigmoid function will be a column vector
    z = np.dot(X,theta) if len(theta) > 1 else X*theta
    
    # cost function
    regterms = np.arange(1,m) if offset else np.arange(0,m)
    #Tracer()() 
    J = (1/n)*(-np.dot(y.transpose(),np.log(sigmoid(z))) - \
                     np.dot((1-y.transpose()),np.log(1-sigmoid(z))) + \
                    (lam/(2))*np.sum(theta[regterms]**2))
  
    # gradient
    #Tracer()()
    reggrad = np.insert(theta[regterms],0,0)
    grad = (1/n)*np.sum((sigmoid(z) - y)*X,0) + (lam/n)*reggrad
    
    
    return np.squeeze(J), np.squeeze(grad)

Small test: make sure the cost function works.



In [ ]:

    
# check that cost function works
theta = np.array([1.0])
lam = 0
cost, grad = costFunction(theta, X, y*1,lam)
print("cost:", cost)
print("grad:", grad)

For diagnostic purposes, we define a callback function that will print information about the state and gradient as the optimization algorithm proceeds.



In [ ]:

    
def callbackF(theta):
    global NFeval
    global X
    global y
    global lam
    cost,grad = costFunction(theta,X,y*1,lam)
    print("%4d   %3.6f  %3.6f  %3.6f" % \
          (NFeval, theta, cost, grad)) 
    NFeval+=1

Optimize

Finally, we perform the logistic regression using scipy's built-in optimization.minimize function.



In [ ]:

    
NFeval = 1
initial_theta = np.array([.1])
print("iter      theta     cost      grad  ")
res = sp.optimize.minimize(lambda t:  costFunction(t,X,y*1,lam), initial_theta, method='CG',\
                           jac=True,options={'maxiter':100,'disp':True},callback=callbackF)



In [ ]:

    
# histogram empirical success probabilities by 
hist, bin_edges = np.histogram(X,bins=100)
p = [np.sum(y[np.where((X>=bin_edges[i]) & (X<bin_edges[i+1]))[0]])/hist[i] for i in np.arange(len(bin_edges)-1)]
bar_pos = np.arange(len(p))
bar_width = np.diff(bin_edges)
plt.bar(bin_edges[0:-1], p, width=bar_width, align='edge', alpha=0.5)
r = np.arange(X.min(),X.max(),.1)
#s = 1/(1+np.exp(-res.x*r))
s = 1/(1+np.exp(-res.x*r))
plt.plot(r,s)

Compare predictions to those from the betting odds

We'll evaluate the success of our methods by comparing our predictions to those implied by the betting markets. In order to do this, we need to establish a 1-1 correspondence between matches in the odds database and matches in the matches database.

Data wrangling

Matching is not straightforward, unfortunately. The odds data contains the variables Location and Tournament, while the matches data contains tourney_name. But the names are not consistent across these datasets: for example, the tournament "French Open" in odds is called "Roland Garros" in matches. The Location variable in odds is generally a pretty close match to the tourney_name in matches, but not always. Moreoever, the date variables differ slightly between these datasets: sometimes, every match in a tournament is pegged with a single date (generally the start date), sometimes each match has the date it was actually played.

The following code attempt to work through these and related issues. There is a fair bit of ugly data wrangling involved: my solution is ultimately to build a lookup table connecting Location in odds to tourney_name in matches. The part of the table where there is an actual match between these variables is easy to build; the other part involves some manual inspection of the underlying CSV files. The code builds the easy part of the table, and then flags which matches still need to be identified manually. I focus on data in the 2010-2016 range.

Get the data from the appropriate timefame



In [803]:

    
# focus on most recent data; exclude Davis Cup stuff
startdate = '20100101'
enddate = '20161231'

with engine.begin() as connection:
    odds = pd.read_sql_query("""SELECT * FROM odds \
                                WHERE DATE >= '""" + startdate + """' \
                                AND DATE <= '""" + enddate + """';""", connection)
    
with engine.begin() as connection:
    matches = pd.read_sql_query("""SELECT * FROM matches \
                                WHERE tourney_date >= '""" + startdate + """' \
                                AND tourney_date <= '""" + enddate + """' \
                                AND tourney_name NOT LIKE 'Davis%%';""", connection)
    
# view results
IP.display.display(odds[0:3])
IP.display.display(matches[0:3])









    







  
    
      
      ATP
      Location
      Tournament
      Date
      Series
      Court
      Surface
      Round
      BestOf
      Winner
      ...
      LBL
      PSW
      PSL
      SJW
      SJL
      MaxW
      MaxL
      AvgW
      AvgL
      fname
    
  
  
    
      0
      1
      Brisbane
      Brisbane International
      2010-01-04
      ATP250
      Outdoor
      Hard
      1st Round
      3
      Gasquet R.
      ...
      2.375
      1.526
      2.740
      1.500
      2.500
      NaN
      NaN
      NaN
      NaN
      2010
    
    
      1
      1
      Brisbane
      Brisbane International
      2010-01-04
      ATP250
      Outdoor
      Hard
      1st Round
      3
      Odesnik W.
      ...
      1.571
      2.140
      1.813
      2.250
      1.571
      NaN
      NaN
      NaN
      NaN
      2010
    
    
      2
      1
      Brisbane
      Brisbane International
      2010-01-04
      ATP250
      Outdoor
      Hard
      1st Round
      3
      Gicquel M.
      ...
      2.100
      1.676
      2.360
      1.615
      2.200
      NaN
      NaN
      NaN
      NaN
      2010
    
  

3 rows × 43 columns







    







  
    
      
      tourney_id
      tourney_name
      surface
      draw_size
      tourney_level
      tourney_date
      match_num
      winner_id
      winner_seed
      winner_entry
      ...
      l_ace
      l_df
      l_svpt
      l_1stIn
      l_1stWon
      l_2ndWon
      l_SvGms
      l_bpSaved
      l_bpFaced
      match_type
    
  
  
    
      0
      2010-301
      Auckland
      Hard
      28
      A
      2010-01-11
      1
      103752
      NaN
      None
      ...
      0.0
      6.0
      64.0
      42.0
      22.0
      11.0
      9.0
      8.0
      12.0
      atp_matches
    
    
      1
      2010-301
      Auckland
      Hard
      28
      A
      2010-01-11
      2
      104545
      NaN
      None
      ...
      5.0
      5.0
      89.0
      56.0
      37.0
      22.0
      15.0
      4.0
      7.0
      atp_matches
    
    
      2
      2010-301
      Auckland
      Hard
      28
      A
      2010-01-11
      3
      104338
      7.0
      None
      ...
      4.0
      2.0
      53.0
      29.0
      18.0
      8.0
      7.0
      5.0
      9.0
      atp_matches
    
  

3 rows × 50 columns

Tidy up the strings (strip whitespace, convert to lowercase, replace dashes by space)



In [879]:

    
odds[['Location','Winner','Loser']] = \
    odds[['Location','Winner','Loser']].apply(lambda x: x.str.strip().str.lower().str.replace('-',' '),axis=1)
matches[['tourney_name','winner_name','loser_name']] = \
    matches[['tourney_name','winner_name','loser_name']].apply(lambda x: x.str.strip().str.lower().str.replace('-',' '),axis=1)

Group both odds and matches by unique tournament



In [805]:

    
g_matches = matches.groupby('tourney_id')
g_odds= odds.groupby(['ATP','fname'])

Extract features to assist in "tourney-matching".



In [806]:

    
def extract_odds_features(group):
    sizes = len(group)
    min_date = group['Date'].min()
    max_date = group['Date'].max()
    location = group['Location'].unique()[0]
    return pd.Series({'size': sizes,'min_date':min_date, 'max_date':max_date,'location':location})

def extract_matches_features(group):
    sizes = len(group)
    min_date = group['tourney_date'].min()
    max_date = group['tourney_date'].max()
    tourney_name = group['tourney_name'].unique()[0]
    return pd.Series({'size': sizes,'min_date':min_date, 'max_date':max_date,'tourney_name':tourney_name})
    
g_odds = g_odds.apply(extract_odds_features).reset_index()
g_matches = g_matches.apply(extract_matches_features).reset_index()

Define lookup-table connecting Location in odds to tourney_name in matches



In [807]:

    
tourney_lookup = pd.read_csv('tourney_lookup.csv')
print("Snapshot of lookup table:")
IP.display.display(tourney_lookup.sort_values('o_name')[15:25])









    



Snapshot of lookup table:






    







  
    
      
      m_name
      o_name
    
  
  
    
      17
      casablanca
      casablanca
    
    
      18
      chengdu
      chengdu
    
    
      19
      chennai
      chennai
    
    
      20
      cincinnati masters
      cincinnati
    
    
      21
      costa do sauipe
      costa do sauipe
    
    
      22
      delray beach
      delray beach
    
    
      23
      doha
      doha
    
    
      24
      dubai
      dubai
    
    
      25
      dusseldorf
      dusseldorf
    
    
      58
      power horse cup
      dusseldorf

Define a function that will take as input a set of features from one tournament in odds, and try to match it to one tournament in matches.



In [808]:

    
def get_tourney_ID(o_row):
    """
    function:  get_tourney_ID(o_row)
    Input:   row from dataframe g_odds
    Output:  a Series object with two elements: 1) a match ID,
                and 2), a flag of True if the sizes of the two tournmanets are identical
    """

    # calculate the diffence in start/stop dates between this tournament and those in `matches`.
    min_date_delta = np.abs(g_matches['min_date'] - o_row['min_date']).apply(lambda x: x.days)
    max_date_delta = np.abs(g_matches['max_date'] - o_row['max_date']).apply(lambda x: x.days)

    # find a list of candidate tournament names, based on lookup table 
    mtchs = (tourney_lookup['o_name']==o_row['location'])
    if sum(mtchs)>0:
        m_name = tourney_lookup.loc[mtchs,'m_name']
    else:
        print('no match found for record {}'.format(o_row['location']))
        return ['Nan','Nan']
    
    # the "right" tournament has the right name, and reasonable close start or stop dates
    idx = ((min_date_delta <=3) | (max_date_delta <=1)) & (g_matches['tourney_name'].isin(m_name))
    record = g_matches.loc[idx,'tourney_id']

    #print("ATP {}, {}, {}".format(o_row['ATP'],o_row['location'],o_row['fname']))
    #pdb.set_trace()
    
    # if there are no matches, print some diagnostic information and don't assign a match
    if len(record)<1:
        print("Warning:  no match found for `odds` match {}, year {}".format(o_row.ATP, o_row.fname))
        print("min date delta: {}, max date delta: {}, g_matches: {}".format(np.min(min_date_delta), \
                                                                             np.min(max_date_delta), \
                                                                             g_matches.loc[g_matches['tourney_name'].isin(m_name),'tourney_name'])) 
        return pd.Series({'ID':'None','size':'NA'})
    
    # if there are too many matches, print a warning and don't assign a match.
    elif (len(record)>1):
        print("Warning:  multiple matches found for `odds` match {}".format(o_row.ATP))
        return pd.Series({'ID':'Multiple','size':'NA'})

    # otherwise, assign a match, and check if the sizes of the matches are consistent (a good double-check)
    else:
        size_flag = (g_matches.loc[idx,'size']==o_row['size'])
        return pd.Series({'ID':record.iloc[0],'size':size_flag.iloc[0]})

Perform the matches.



In [809]:

    
# add columns to g_odds to hold match ID and also info about size-correspondence
g_odds.insert(len(g_odds.columns),'ID','None')
g_odds.insert(len(g_odds.columns),'sizes_match','NA')

# perform the match
g_odds[['ID','sizes_match']] = g_odds.apply(get_tourney_ID,axis=1).values

Merge match numbers back into bigger odds table if the previous routine produced no errors.



In [810]:

    
# merge on ATP and year
if sum(g_odds['sizes_match']==True) == len(g_odds):
    odds = pd.merge(g_odds[['ATP','fname','ID','size']],odds,how='inner',on=['ATP','fname'])
else:
    print("At least one tournament in `odds` is matched to a tournament in `matches` of a different size.")



In [811]:

    
matches = pd.merge(g_matches[['tourney_id','size']],matches,how='inner',on=['tourney_id'])



In [812]:

    
# match numbers to odds
odds.insert(5,'match_num',0)
grouped = odds[['ID','match_num']].groupby('ID')
odds['match_num'] = grouped.transform(lambda x: 1+np.arange(len(x)))



In [840]:

    
# add keys to both odds and match data
odds.insert(len(odds.columns),'key',np.arange(len(odds)))
matches.insert(len(matches.columns),'key',np.arange(len(matches)))

Assign a match_id number to each individual match

The matches dataset includes a unique match number for each match within a tournament. The odds dataset does not. To assign such numbers, we write a function that takes as input all the records for a given tournament from both matches and odds, tries to match player names, and returns either the match number (if the player name match was successful) or nan (if not.)



In [245]:



In [ ]:

    
g_matches = matches.groupby('tourney_id')
g_odds= odds.groupby(['ATP','fname'])



In [814]:

    
# create a lookup table to be able to match on rounds
m_rounds = ['R128','R64','R32','R16','QF','SF','F','RR']
o_rounds = ['1st Round','2nd Round','3rd Round','4th Round', \
            'Quarterfinals','Semifinals','The Final','Round Robin']

round_lookup_small = pd.DataFrame({'m_rounds': m_rounds[2:-1],\
                                   'o_rounds':o_rounds[0:2]+o_rounds[4:-1]})

round_lookup_medium = pd.DataFrame({'m_rounds': m_rounds[1:-1],\
                                   'o_rounds':o_rounds[0:3]+o_rounds[4:-1]})

round_lookup_large = pd.DataFrame({'m_rounds': m_rounds[0:-1],\
                                   'o_rounds':o_rounds[0:-1]})

round_lookup_RR  = pd.DataFrame({'m_rounds':m_rounds[5:],\
                                 'o_rounds':o_rounds[5:]})



In [815]:

    
# figure out how many discrete sizes there are
print("size in odds:  ", odds['size'].unique())
print("size in matches:  ", matches['size'].unique())









    



size in odds:   [ 31  27 127  95  55  47  15]
size in matches:   [ 55  27  31  47  95 127  15  23  26  64]



In [816]:

    
def map_rounds(x):

    cur_name = x['Round']
    t_size = x['size']

    if t_size in [27,31]:
        new_name = round_lookup_small.loc[round_lookup_small.o_rounds==cur_name,'m_rounds']
    elif t_size in [47,55]:
        new_name = round_lookup_medium.loc[round_lookup_medium.o_rounds==cur_name,'m_rounds']
    elif t_size in [95, 127]:
        new_name = round_lookup_large.loc[round_lookup_large.o_rounds==cur_name,'m_rounds']
    else:
        new_name = round_lookup_RR.loc[round_lookup_RR.o_rounds==cur_name,'m_rounds'] 

    return new_name.iloc[0]



In [817]:

    
# translate round indentifier appropriately
odds.insert(4,'round','TBD')
odds['round'] = odds.apply(map_rounds,axis=1).values

IP.display.display(odds[0:4])









    







  
    
      
      ATP
      fname
      ID
      size
      round
      Location
      match_num
      Tournament
      Date
      Series
      ...
      LBL
      PSW
      PSL
      SJW
      SJL
      MaxW
      MaxL
      AvgW
      AvgL
      key
    
  
  
    
      0
      1
      2010
      2010-339
      31
      R32
      brisbane
      1
      Brisbane International
      2010-01-04
      ATP250
      ...
      2.375
      1.526
      2.740
      1.500
      2.500
      NaN
      NaN
      NaN
      NaN
      0
    
    
      1
      1
      2010
      2010-339
      31
      R32
      brisbane
      2
      Brisbane International
      2010-01-04
      ATP250
      ...
      1.571
      2.140
      1.813
      2.250
      1.571
      NaN
      NaN
      NaN
      NaN
      1
    
    
      2
      1
      2010
      2010-339
      31
      R32
      brisbane
      3
      Brisbane International
      2010-01-04
      ATP250
      ...
      2.100
      1.676
      2.360
      1.615
      2.200
      NaN
      NaN
      NaN
      NaN
      2
    
    
      3
      1
      2010
      2010-339
      31
      R32
      brisbane
      4
      Brisbane International
      2010-01-04
      ATP250
      ...
      1.500
      2.580
      1.581
      2.500
      1.500
      NaN
      NaN
      NaN
      NaN
      3
    
  

4 rows × 48 columns

A quick sanity check: are there the same number of matches in each set?



In [819]:

    
t1=odds.ID.drop_duplicates().sort_values()
t2=matches.tourney_id.drop_duplicates().sort_values()
m_sizes=matches.loc[matches.tourney_id.isin(t1),['tourney_id','size']].drop_duplicates()
o_sizes=odds.loc[odds.ID.isin(t2),['ID','size']].drop_duplicates()
comp = pd.merge(o_sizes,m_sizes,how='outer',left_on='ID',right_on='tourney_id')
print('sum of sizes of tournaments in odds: ', np.sum(o_sizes['size']))
print('sum of sizes of tournaments in matches: ', np.sum(m_sizes['size']))









    



sum of sizes of tournaments in odds:  18448
sum of sizes of tournaments in matches:  18448



In [825]:

    
matches = matches.loc[matches.tourney_id.isin(t1),:]
print("number of records in `odds`:  ", len(odds))
print("number of records in `matches`:  ", len(matches))









    



number of records in `odds`:   18448
number of records in `matches`:   18448

Now add player names to the mix, and try to match names, rounds, and matchid's.



In [882]:

    
# extract dataframe with player names split into discrete 'words'
m_players = pd.merge(matches.winner_name.str.split(pat=' ',expand=True), \
                     matches.loser_name.str.split(pat=' ',expand=True), \
                    how='inner',left_index=True, right_index=True,suffixes=('_W','_L'))  

# join tournament, round, and match identifiers
m_players = pd.merge(matches[['tourney_id','match_num', 'round','key']], m_players,\
                    how='inner',left_index=True, right_index=True).sort_values(['tourney_id','round','1_W','1_L'])

# extract dataframe with player names split into discrete 'words'
o_players = pd.merge(odds.Winner.str.split(pat=' ',expand=True), \
                     odds.Loser.str.split(pat=' ',expand=True), \
                    how='inner',left_index=True, right_index=True,suffixes=('_W','_L')) 

# join tournament and round identifiers
o_players = pd.merge(odds[['ID','round','match_num','key']], o_players,\
                    how='inner',left_index=True, right_index=True).sort_values(['ID','round','0_W','0_L'])




print("m_players: ")
IP.display.display(m_players[0:5])
print("o_players")
IP.display.display(o_players[0:5])









    



m_players: 






    







  
    
      
      tourney_id
      match_num
      round
      key
      0_W
      1_W
      2_W
      3_W
      4_W
      0_L
      1_L
      2_L
      3_L
      4_L
    
  
  
    
      54
      2010-1536
      55
      F
      54
      rafael
      nadal
      None
      None
      None
      roger
      federer
      None
      None
      None
    
    
      50
      2010-1536
      51
      QF
      50
      nicolas
      almagro
      None
      None
      None
      jurgen
      melzer
      None
      None
      None
    
    
      48
      2010-1536
      49
      QF
      48
      roger
      federer
      None
      None
      None
      ernests
      gulbis
      None
      None
      None
    
    
      49
      2010-1536
      50
      QF
      49
      david
      ferrer
      None
      None
      None
      andy
      murray
      None
      None
      None
    
    
      51
      2010-1536
      52
      QF
      51
      rafael
      nadal
      None
      None
      None
      gael
      monfils
      None
      None
      None
    
  








    



o_players






    







  
    
      
      ID
      round
      match_num
      key
      0_W
      1_W
      2_W
      3_W
      4_W
      0_L
      1_L
      2_L
      3_L
      4_L
    
  
  
    
      7722
      2010-1536
      F
      55
      7722
      nadal
      r.
      None
      None
      None
      federer
      r.
      None
      None
      None
    
    
      7716
      2010-1536
      QF
      49
      7716
      almagro
      n.
      None
      None
      None
      melzer
      j.
      None
      None
      None
    
    
      7718
      2010-1536
      QF
      51
      7718
      federer
      r.
      None
      None
      None
      gulbis
      e.
      None
      None
      None
    
    
      7719
      2010-1536
      QF
      52
      7719
      ferrer
      d.
      None
      None
      None
      murray
      a.
      None
      None
      None
    
    
      7717
      2010-1536
      QF
      50
      7717
      nadal
      r.
      None
      None
      None
      monfils
      g.
      None
      None
      None

Match individual matches as follows: assuming there is a match on both tournament number and round, then

First try to match both winner and loser names
In the compliment of the result, try to match just winner name or just loser names.
Merge 1. and 2., see what's left over.



In [883]:

    
A = pd.merge(m_players[['tourney_id','round','key','1_W','1_L']],\
             o_players[['ID','round','key','0_W','0_L']],how='inner',\
             left_on=['tourney_id','round','1_W','1_L'],\
             right_on=['ID','round','0_W','0_L'],suffixes=['_m','_o'])

m_extras = m_players.loc[~m_players.key.isin(A.key_m),:]
o_extras = o_players.loc[~o_players.key.isin(A.key_o),:]

IP.display.display(m_extras[0:10])
IP.display.display(o_extras[0:10])









    







  
    
      
      tourney_id
      match_num
      round
      key
      0_W
      1_W
      2_W
      3_W
      4_W
      0_L
      1_L
      2_L
      3_L
      4_L
    
  
  
    
      36
      2010-1536
      37
      R32
      36
      guillermo
      garcia
      lopez
      None
      None
      jo
      wilfried
      tsonga
      None
      None
    
    
      28
      2010-1536
      29
      R32
      28
      andy
      murray
      None
      None
      None
      juan
      ignacio
      chela
      None
      None
    
    
      6
      2010-1536
      7
      R64
      6
      juan
      ignacio
      chela
      None
      None
      pablo
      cuevas
      None
      None
      None
    
    
      11
      2010-1536
      12
      R64
      11
      eduardo
      schwank
      None
      None
      None
      paul
      henri
      mathieu
      None
      None
    
    
      57
      2010-1720
      3
      R32
      57
      olivier
      rochus
      None
      None
      None
      juan
      martin
      del
      potro
      None
    
    
      101
      2010-2276
      20
      R16
      101
      illya
      marchenko
      None
      None
      None
      ivan
      dodig
      None
      None
      None
    
    
      100
      2010-2276
      19
      R16
      100
      jurgen
      melzer
      None
      None
      None
      olivier
      rochus
      None
      None
      None
    
    
      82
      2010-2276
      1
      R32
      82
      marin
      cilic
      None
      None
      None
      jan
      hajek
      None
      None
      None
    
    
      85
      2010-2276
      4
      R32
      85
      ivo
      karlovic
      None
      None
      None
      antonio
      veic
      None
      None
      None
    
    
      127
      2010-301
      15
      R16
      127
      michael
      lammer
      None
      None
      None
      juan
      carlos
      ferrero
      None
      None
    
  








    







  
    
      
      ID
      round
      match_num
      key
      0_W
      1_W
      2_W
      3_W
      4_W
      0_L
      1_L
      2_L
      3_L
      4_L
    
  
  
    
      7707
      2010-1536
      R32
      40
      7707
      garcia
      lopez
      g.
      None
      None
      tsonga
      j.w.
      None
      None
      None
    
    
      7705
      2010-1536
      R32
      38
      7705
      murray
      a.
      None
      None
      None
      chela
      j.i.
      None
      None
      None
    
    
      7680
      2010-1536
      R64
      13
      7680
      chela
      j.i.
      None
      None
      None
      cuevas
      p.
      None
      None
      None
    
    
      7679
      2010-1536
      R64
      12
      7679
      schwank
      e.
      None
      None
      None
      mathieu
      p.h.
      None
      None
      None
    
    
      15205
      2010-1720
      R32
      11
      15205
      rochus
      o.
      None
      None
      None
      del
      potro
      j.m.
      None
      None
    
    
      2293
      2010-2276
      R16
      18
      2293
      cilic
      m.
      None
      None
      None
      hajek
      j.
      None
      None
      None
    
    
      2294
      2010-2276
      R16
      19
      2294
      karlovic
      i.
      None
      None
      None
      veic
      a.
      None
      None
      None
    
    
      2289
      2010-2276
      R32
      14
      2289
      marchenko
      i.
      None
      None
      None
      dodig
      i.
      None
      None
      None
    
    
      2291
      2010-2276
      R32
      16
      2291
      melzer
      j.
      None
      None
      None
      rochus
      o.
      None
      None
      None
    
    
      627
      2010-301
      R16
      13
      627
      lammer
      m.
      None
      None
      None
      ferrero
      j.c.
      None
      None
      None

Is it possible for the same player to play multiple times in a single round, for a single tournament?



In [871]:

    
m_extras.loc[m_extras[['tourney_id','round','0_W','1_W','2_W']].duplicated(keep=False),:]









    Out[871]:







  
    
      
      tourney_id
      match_num
      round
      key
      0_W
      1_W
      2_W
      3_W
      4_W
      0_L
      1_L
      2_L
      3_L
      4_L
    
  
  
    
      5117
      2011-605
      4
      RR
      5063
      jo
      wilfried
      tsonga
      None
      None
      mardy
      fish
      None
      None
      None
    
    
      5116
      2011-605
      3
      RR
      5062
      jo
      wilfried
      tsonga
      None
      None
      rafael
      nadal
      None
      None
      None
    
    
      7780
      2012-605
      11
      RR
      7685
      juan
      martin
      del
      potro
      None
      roger
      federer
      None
      None
      None
    
    
      7777
      2012-605
      8
      RR
      7682
      juan
      martin
      del
      potro
      None
      janko
      tipsarevic
      None
      None
      None



In [899]:

    
def comp_str_lists(a,b):
    for i in a:
        if i in b:
            return True        
    return False



In [494]:

    
A = pd.merge(m_players[['tourney_id','round','key','1_W','1_L']],\
             o_players[['ID','round','key','0_W','0_L']],how='inner',\
             left_on=['tourney_id','round','1_W','1_L'],\
             right_on=['ID','round','0_W','0_L'])



B = pd.merge(m_players,o_players,how='inner',\
             left_on=['tourney_id','round','1_W','1_L'],\
             right_on=['ID','round','0_W','0_L'])

ldf = pd.merge(m_players,o_players,how='inner',left_on=['tourney_id','round','1_L'],right_on=['ID','Round2','0_L'])
IP.display.display(wdf[1:15])
IP.display.display(ldf[1:15])









    







  
    
      
      tourney_id
      match_num
      round
      0_W_x
      1_W_x
      2_W_x
      3_W_x
      4_W_x
      0_L_x
      1_L_x
      ...
      0_W_y
      1_W_y
      2_W_y
      3_W_y
      4_W_y
      0_L_y
      1_L_y
      2_L_y
      3_L_y
      4_L_y
    
  
  
    
      1
      2010-1536
      2
      R64
      leonardo
      mayer
      None
      None
      None
      julien
      benneteau
      ...
      mayer
      l.
      None
      None
      None
      benneteau
      j.
      None
      None
      None
    
    
      2
      2010-1536
      3
      R64
      stanislas
      wawrinka
      None
      None
      None
      marcel
      granollers
      ...
      wawrinka
      s.
      None
      None
      None
      granollers
      m.
      None
      None
      None
    
    
      3
      2010-1536
      4
      R64
      mikhail
      youzhny
      None
      None
      None
      lukas
      lacko
      ...
      youzhny
      m.
      None
      None
      None
      lacko
      l.
      None
      None
      None
    
    
      4
      2010-1536
      5
      R64
      ernests
      gulbis
      None
      None
      None
      albert
      montanes
      ...
      gulbis
      e.
      None
      None
      None
      montanes
      a.
      None
      None
      None
    
    
      5
      2010-1536
      6
      R64
      feliciano
      lopez
      None
      None
      None
      lukasz
      kubot
      ...
      lopez
      f.
      None
      None
      None
      kubot
      l.
      None
      None
      None
    
    
      6
      2010-1536
      8
      R64
      victor
      hanescu
      None
      None
      None
      igor
      andreev
      ...
      hanescu
      v.
      None
      None
      None
      andreev
      i.
      None
      None
      None
    
    
      7
      2010-1536
      10
      R64
      david
      ferrer
      None
      None
      None
      jeremy
      chardy
      ...
      ferrer
      d.
      None
      None
      None
      chardy
      j.
      None
      None
      None
    
    
      8
      2010-1536
      11
      R64
      marcos
      baghdatis
      None
      None
      None
      marco
      chiudinelli
      ...
      baghdatis
      m.
      None
      None
      None
      chiudinelli
      m.
      None
      None
      None
    
    
      9
      2010-1536
      12
      R64
      eduardo
      schwank
      None
      None
      None
      paul
      henri
      ...
      schwank
      e.
      None
      None
      None
      mathieu
      p.h.
      None
      None
      None
    
    
      10
      2010-1536
      13
      R64
      ivo
      karlovic
      None
      None
      None
      evgeny
      korolev
      ...
      karlovic
      i.
      None
      None
      None
      korolev
      e.
      None
      None
      None
    
    
      11
      2010-1536
      14
      R64
      jurgen
      melzer
      None
      None
      None
      kevin
      anderson
      ...
      melzer
      j.
      None
      None
      None
      anderson
      k.
      None
      None
      None
    
    
      12
      2010-1536
      15
      R64
      mardy
      fish
      None
      None
      None
      michael
      russell
      ...
      fish
      m.
      None
      None
      None
      russell
      m.
      None
      None
      None
    
    
      13
      2010-1536
      16
      R64
      thomaz
      bellucci
      None
      None
      None
      pere
      riba
      ...
      bellucci
      t.
      None
      None
      None
      riba
      p.
      None
      None
      None
    
    
      14
      2010-1536
      17
      R64
      juan
      monaco
      None
      None
      None
      simon
      greul
      ...
      monaco
      j.
      None
      None
      None
      greul
      s.
      None
      None
      None
    
  

14 rows × 25 columns







    







  
    
      
      tourney_id
      match_num
      round
      0_W_x
      1_W_x
      2_W_x
      3_W_x
      4_W_x
      0_L_x
      1_L_x
      ...
      0_W_y
      1_W_y
      2_W_y
      3_W_y
      4_W_y
      0_L_y
      1_L_y
      2_L_y
      3_L_y
      4_L_y
    
  
  
    
      1
      2010-1536
      2
      R64
      leonardo
      mayer
      None
      None
      None
      julien
      benneteau
      ...
      mayer
      l.
      None
      None
      None
      benneteau
      j.
      None
      None
      None
    
    
      2
      2010-1536
      3
      R64
      stanislas
      wawrinka
      None
      None
      None
      marcel
      granollers
      ...
      wawrinka
      s.
      None
      None
      None
      granollers
      m.
      None
      None
      None
    
    
      3
      2010-1536
      4
      R64
      mikhail
      youzhny
      None
      None
      None
      lukas
      lacko
      ...
      youzhny
      m.
      None
      None
      None
      lacko
      l.
      None
      None
      None
    
    
      4
      2010-1536
      5
      R64
      ernests
      gulbis
      None
      None
      None
      albert
      montanes
      ...
      gulbis
      e.
      None
      None
      None
      montanes
      a.
      None
      None
      None
    
    
      5
      2010-1536
      6
      R64
      feliciano
      lopez
      None
      None
      None
      lukasz
      kubot
      ...
      lopez
      f.
      None
      None
      None
      kubot
      l.
      None
      None
      None
    
    
      6
      2010-1536
      7
      R64
      juan
      ignacio
      chela
      None
      None
      pablo
      cuevas
      ...
      chela
      j.i.
      None
      None
      None
      cuevas
      p.
      None
      None
      None
    
    
      7
      2010-1536
      8
      R64
      victor
      hanescu
      None
      None
      None
      igor
      andreev
      ...
      hanescu
      v.
      None
      None
      None
      andreev
      i.
      None
      None
      None
    
    
      8
      2010-1536
      9
      R64
      daniel
      munoz
      de
      la
      nava
      sam
      querrey
      ...
      munoz-de
      la
      nava
      d.
      None
      querrey
      s.
      None
      None
      None
    
    
      9
      2010-1536
      10
      R64
      david
      ferrer
      None
      None
      None
      jeremy
      chardy
      ...
      ferrer
      d.
      None
      None
      None
      chardy
      j.
      None
      None
      None
    
    
      10
      2010-1536
      11
      R64
      marcos
      baghdatis
      None
      None
      None
      marco
      chiudinelli
      ...
      baghdatis
      m.
      None
      None
      None
      chiudinelli
      m.
      None
      None
      None
    
    
      11
      2010-1536
      13
      R64
      ivo
      karlovic
      None
      None
      None
      evgeny
      korolev
      ...
      karlovic
      i.
      None
      None
      None
      korolev
      e.
      None
      None
      None
    
    
      12
      2010-1536
      14
      R64
      jurgen
      melzer
      None
      None
      None
      kevin
      anderson
      ...
      melzer
      j.
      None
      None
      None
      anderson
      k.
      None
      None
      None
    
    
      13
      2010-1536
      15
      R64
      mardy
      fish
      None
      None
      None
      michael
      russell
      ...
      fish
      m.
      None
      None
      None
      russell
      m.
      None
      None
      None
    
    
      14
      2010-1536
      16
      R64
      thomaz
      bellucci
      None
      None
      None
      pere
      riba
      ...
      bellucci
      t.
      None
      None
      None
      riba
      p.
      None
      None
      None
    
  

14 rows × 25 columns



In [515]:

    
wldf = pd.merge(wdf,ldf,how='outer',on=['tourney_id','match_num'])
IP.display.display(wldf.loc[wldf.tourney_id=='2010-1536'])









    







  
    
      
      tourney_id
      match_num
      round_x
      0_W_x_x
      1_W_x_x
      2_W_x_x
      3_W_x_x
      4_W_x_x
      0_L_x_x
      1_L_x_x
      ...
      0_W_y_y
      1_W_y_y
      2_W_y_y
      3_W_y_y
      4_W_y_y
      0_L_y_y
      1_L_y_y
      2_L_y_y
      3_L_y_y
      4_L_y_y
    
  
  
    
      0
      2010-1536
      1
      R64
      benjamin
      becker
      None
      None
      None
      carlos
      moya
      ...
      becker
      b.
      None
      None
      None
      moya
      c.
      None
      None
      None
    
    
      1
      2010-1536
      2
      R64
      leonardo
      mayer
      None
      None
      None
      julien
      benneteau
      ...
      mayer
      l.
      None
      None
      None
      benneteau
      j.
      None
      None
      None
    
    
      2
      2010-1536
      3
      R64
      stanislas
      wawrinka
      None
      None
      None
      marcel
      granollers
      ...
      wawrinka
      s.
      None
      None
      None
      granollers
      m.
      None
      None
      None
    
    
      3
      2010-1536
      4
      R64
      mikhail
      youzhny
      None
      None
      None
      lukas
      lacko
      ...
      youzhny
      m.
      None
      None
      None
      lacko
      l.
      None
      None
      None
    
    
      4
      2010-1536
      5
      R64
      ernests
      gulbis
      None
      None
      None
      albert
      montanes
      ...
      gulbis
      e.
      None
      None
      None
      montanes
      a.
      None
      None
      None
    
    
      5
      2010-1536
      6
      R64
      feliciano
      lopez
      None
      None
      None
      lukasz
      kubot
      ...
      lopez
      f.
      None
      None
      None
      kubot
      l.
      None
      None
      None
    
    
      6
      2010-1536
      8
      R64
      victor
      hanescu
      None
      None
      None
      igor
      andreev
      ...
      hanescu
      v.
      None
      None
      None
      andreev
      i.
      None
      None
      None
    
    
      7
      2010-1536
      10
      R64
      david
      ferrer
      None
      None
      None
      jeremy
      chardy
      ...
      ferrer
      d.
      None
      None
      None
      chardy
      j.
      None
      None
      None
    
    
      8
      2010-1536
      11
      R64
      marcos
      baghdatis
      None
      None
      None
      marco
      chiudinelli
      ...
      baghdatis
      m.
      None
      None
      None
      chiudinelli
      m.
      None
      None
      None
    
    
      9
      2010-1536
      12
      R64
      eduardo
      schwank
      None
      None
      None
      paul
      henri
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      10
      2010-1536
      13
      R64
      ivo
      karlovic
      None
      None
      None
      evgeny
      korolev
      ...
      karlovic
      i.
      None
      None
      None
      korolev
      e.
      None
      None
      None
    
    
      11
      2010-1536
      14
      R64
      jurgen
      melzer
      None
      None
      None
      kevin
      anderson
      ...
      melzer
      j.
      None
      None
      None
      anderson
      k.
      None
      None
      None
    
    
      12
      2010-1536
      15
      R64
      mardy
      fish
      None
      None
      None
      michael
      russell
      ...
      fish
      m.
      None
      None
      None
      russell
      m.
      None
      None
      None
    
    
      13
      2010-1536
      16
      R64
      thomaz
      bellucci
      None
      None
      None
      pere
      riba
      ...
      bellucci
      t.
      None
      None
      None
      riba
      p.
      None
      None
      None
    
    
      14
      2010-1536
      17
      R64
      juan
      monaco
      None
      None
      None
      simon
      greul
      ...
      monaco
      j.
      None
      None
      None
      greul
      s.
      None
      None
      None
    
    
      15
      2010-1536
      18
      R64
      nicolas
      almagro
      None
      None
      None
      viktor
      troicki
      ...
      almagro
      n.
      None
      None
      None
      troicki
      v.
      None
      None
      None
    
    
      16
      2010-1536
      20
      R64
      philipp
      petzschner
      None
      None
      None
      fabio
      fognini
      ...
      petzschner
      p.
      None
      None
      None
      fognini
      f.
      None
      None
      None
    
    
      17
      2010-1536
      21
      R64
      gael
      monfils
      None
      None
      None
      stephane
      robert
      ...
      monfils
      g.
      None
      None
      None
      robert
      s.
      None
      None
      None
    
    
      18
      2010-1536
      22
      R64
      john
      isner
      None
      None
      None
      christophe
      rochus
      ...
      isner
      j.
      None
      None
      None
      rochus
      c.
      None
      None
      None
    
    
      19
      2010-1536
      23
      R64
      santiago
      giraldo
      None
      None
      None
      philipp
      kohlschreiber
      ...
      giraldo
      s.
      None
      None
      None
      kohlschreiber
      p.
      None
      None
      None
    
    
      20
      2010-1536
      24
      R64
      alexandr
      dolgopolov
      None
      None
      None
      andreas
      seppi
      ...
      dolgopolov
      o.
      None
      None
      None
      seppi
      a.
      None
      None
      None
    
    
      21
      2010-1536
      25
      R32
      roger
      federer
      None
      None
      None
      benjamin
      becker
      ...
      federer
      r.
      None
      None
      None
      becker
      b.
      None
      None
      None
    
    
      22
      2010-1536
      26
      R32
      stanislas
      wawrinka
      None
      None
      None
      leonardo
      mayer
      ...
      wawrinka
      s.
      None
      None
      None
      mayer
      l.
      None
      None
      None
    
    
      23
      2010-1536
      27
      R32
      ernests
      gulbis
      None
      None
      None
      mikhail
      youzhny
      ...
      gulbis
      e.
      None
      None
      None
      youzhny
      m.
      None
      None
      None
    
    
      24
      2010-1536
      28
      R32
      feliciano
      lopez
      None
      None
      None
      oscar
      hernandez
      ...
      lopez
      f.
      None
      None
      None
      hernandez
      o.
      None
      None
      None
    
    
      25
      2010-1536
      29
      R32
      andy
      murray
      None
      None
      None
      juan
      ignacio
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      26
      2010-1536
      30
      R32
      victor
      hanescu
      None
      None
      None
      daniel
      munoz
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      27
      2010-1536
      31
      R32
      david
      ferrer
      None
      None
      None
      marcos
      baghdatis
      ...
      ferrer
      d.
      None
      None
      None
      baghdatis
      m.
      None
      None
      None
    
    
      28
      2010-1536
      32
      R32
      marin
      cilic
      None
      None
      None
      eduardo
      schwank
      ...
      cilic
      m.
      None
      None
      None
      schwank
      e.
      None
      None
      None
    
    
      29
      2010-1536
      33
      R32
      fernando
      verdasco
      None
      None
      None
      ivo
      karlovic
      ...
      verdasco
      f.
      None
      None
      None
      karlovic
      i.
      None
      None
      None
    
    
      30
      2010-1536
      34
      R32
      jurgen
      melzer
      None
      None
      None
      mardy
      fish
      ...
      melzer
      j.
      None
      None
      None
      fish
      m.
      None
      None
      None
    
    
      31
      2010-1536
      35
      R32
      juan
      monaco
      None
      None
      None
      thomaz
      bellucci
      ...
      monaco
      j.
      None
      None
      None
      bellucci
      t.
      None
      None
      None
    
    
      32
      2010-1536
      36
      R32
      nicolas
      almagro
      None
      None
      None
      robin
      soderling
      ...
      almagro
      n.
      None
      None
      None
      soderling
      r.
      None
      None
      None
    
    
      33
      2010-1536
      38
      R32
      gael
      monfils
      None
      None
      None
      philipp
      petzschner
      ...
      monfils
      g.
      None
      None
      None
      petzschner
      p.
      None
      None
      None
    
    
      34
      2010-1536
      39
      R32
      john
      isner
      None
      None
      None
      santiago
      giraldo
      ...
      isner
      j.
      None
      None
      None
      giraldo
      s.
      None
      None
      None
    
    
      35
      2010-1536
      40
      R32
      rafael
      nadal
      None
      None
      None
      alexandr
      dolgopolov
      ...
      nadal
      r.
      None
      None
      None
      dolgopolov
      o.
      None
      None
      None
    
    
      36
      2010-1536
      41
      R16
      roger
      federer
      None
      None
      None
      stanislas
      wawrinka
      ...
      federer
      r.
      None
      None
      None
      wawrinka
      s.
      None
      None
      None
    
    
      37
      2010-1536
      42
      R16
      ernests
      gulbis
      None
      None
      None
      feliciano
      lopez
      ...
      gulbis
      e.
      None
      None
      None
      lopez
      f.
      None
      None
      None
    
    
      38
      2010-1536
      43
      R16
      andy
      murray
      None
      None
      None
      victor
      hanescu
      ...
      murray
      a.
      None
      None
      None
      hanescu
      v.
      None
      None
      None
    
    
      39
      2010-1536
      44
      R16
      david
      ferrer
      None
      None
      None
      marin
      cilic
      ...
      ferrer
      d.
      None
      None
      None
      cilic
      m.
      None
      None
      None
    
    
      40
      2010-1536
      45
      R16
      jurgen
      melzer
      None
      None
      None
      fernando
      verdasco
      ...
      melzer
      j.
      None
      None
      None
      verdasco
      f.
      None
      None
      None
    
    
      41
      2010-1536
      46
      R16
      nicolas
      almagro
      None
      None
      None
      juan
      monaco
      ...
      almagro
      n.
      None
      None
      None
      monaco
      j.
      None
      None
      None
    
    
      42
      2010-1536
      47
      R16
      gael
      monfils
      None
      None
      None
      guillermo
      garcia
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      43
      2010-1536
      48
      R16
      rafael
      nadal
      None
      None
      None
      john
      isner
      ...
      nadal
      r.
      None
      None
      None
      isner
      j.
      None
      None
      None
    
    
      44
      2010-1536
      49
      QF
      roger
      federer
      None
      None
      None
      ernests
      gulbis
      ...
      federer
      r.
      None
      None
      None
      gulbis
      e.
      None
      None
      None
    
    
      45
      2010-1536
      50
      QF
      david
      ferrer
      None
      None
      None
      andy
      murray
      ...
      ferrer
      d.
      None
      None
      None
      murray
      a.
      None
      None
      None
    
    
      46
      2010-1536
      51
      QF
      nicolas
      almagro
      None
      None
      None
      jurgen
      melzer
      ...
      almagro
      n.
      None
      None
      None
      melzer
      j.
      None
      None
      None
    
    
      47
      2010-1536
      52
      QF
      rafael
      nadal
      None
      None
      None
      gael
      monfils
      ...
      nadal
      r.
      None
      None
      None
      monfils
      g.
      None
      None
      None
    
    
      48
      2010-1536
      53
      SF
      roger
      federer
      None
      None
      None
      david
      ferrer
      ...
      federer
      r.
      None
      None
      None
      ferrer
      d.
      None
      None
      None
    
    
      49
      2010-1536
      54
      SF
      rafael
      nadal
      None
      None
      None
      nicolas
      almagro
      ...
      nadal
      r.
      None
      None
      None
      almagro
      n.
      None
      None
      None
    
    
      50
      2010-1536
      55
      F
      rafael
      nadal
      None
      None
      None
      roger
      federer
      ...
      nadal
      r.
      None
      None
      None
      federer
      r.
      None
      None
      None
    
    
      17434
      2010-1536
      7
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      chela
      j.i.
      None
      None
      None
      cuevas
      p.
      None
      None
      None
    
    
      17435
      2010-1536
      9
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      munoz-de
      la
      nava
      d.
      None
      querrey
      s.
      None
      None
      None
    
  

53 rows × 48 columns



In [516]:

    
wldf.columns









    Out[516]:





Index(['tourney_id', 'match_num', 'round_x', '0_W_x_x', '1_W_x_x', '2_W_x_x',
       '3_W_x_x', '4_W_x_x', '0_L_x_x', '1_L_x_x', '2_L_x_x', '3_L_x_x',
       '4_L_x_x', 'ID_x', 'Round2_x', '0_W_y_x', '1_W_y_x', '2_W_y_x',
       '3_W_y_x', '4_W_y_x', '0_L_y_x', '1_L_y_x', '2_L_y_x', '3_L_y_x',
       '4_L_y_x', 'round_y', '0_W_x_y', '1_W_x_y', '2_W_x_y', '3_W_x_y',
       '4_W_x_y', '0_L_x_y', '1_L_x_y', '2_L_x_y', '3_L_x_y', '4_L_x_y',
       'ID_y', 'Round2_y', '0_W_y_y', '1_W_y_y', '2_W_y_y', '3_W_y_y',
       '4_W_y_y', '0_L_y_y', '1_L_y_y', '2_L_y_y', '3_L_y_y', '4_L_y_y'],
      dtype='object')

In general, we'll try to match the last name in odds with the last name in matches (eg. column 0_W with column 1_W. Long names are problematic. The following code flags names with more than two "words" and tries to build some structure to accomodate such names.



In [539]:

    
# identify potentially problematic names (generally, those of more than two 'words')
m_bad_winners = m_players.loc[pd.notnull(m_players.loc[:,'2_W']),\
                              ['tourney_id','match_num','0_W','1_W','2_W','3_W','4_W']].drop_duplicates(subset=['0_W','1_W','2_W','3_W','4_W'])
m_bad_losers = m_players.loc[pd.notnull(m_players.loc[:,'2_L']),\
                             ['tourney_id','match_num','0_L','1_L','2_L','3_L','4_L']].drop_duplicates(subset=['0_L','1_L','2_L','3_L','4_L'])
o_bad_winners = o_players.loc[pd.notnull(o_players.loc[:,'2_W']),\
                              ['ID','0_W','1_W','2_W','3_W','4_W']].drop_duplicates(subset=['0_W','1_W','2_W','3_W','4_W'])
o_bad_losers = o_players.loc[pd.notnull(o_players.loc[:,'2_L']),\
                             ['ID','0_L','1_L','2_L','3_L','4_L']].drop_duplicates(subset=['0_L','1_L','2_L','3_L','4_L'])

print("m_bad_winners")
IP.display.display(m_bad_winners)

print("o_bad_winners")
IP.display.display(o_bad_winners)









    



m_bad_winners






    







  
    
      
      tourney_id
      match_num
      0_W
      1_W
      2_W
      3_W
      4_W
    
  
  
    
      6
      2010-1536
      7
      juan
      ignacio
      chela
      None
      None
    
    
      8
      2010-1536
      9
      daniel
      munoz
      de
      la
      nava
    
    
      18
      2010-1536
      19
      guillermo
      garcia
      lopez
      None
      None
    
    
      64
      2010-1720
      10
      thiemo
      de
      bakker
      None
      None
    
    
      117
      2010-301
      5
      jose
      rubin
      statham
      None
      None
    
    
      231
      2010-314
      6
      andreas
      haider
      maurer
      None
      None
    
    
      234
      2010-314
      9
      daniel
      gimeno
      traver
      None
      None
    
    
      332
      2010-321
      18
      juan
      carlos
      ferrero
      None
      None
    
    
      350
      2010-328
      9
      paul
      henri
      mathieu
      None
      None
    
    
      611
      2010-360
      2
      reda
      el
      amrani
      None
      None
    
    
      656
      2010-375
      20
      jo
      wilfried
      tsonga
      None
      None
    
    
      665
      2010-402
      2
      yen
      hsun
      lu
      None
      None
    
    
      1464
      2010-451
      14
      younes
      el
      aynaoui
      None
      None
    
    
      1638
      2010-5012
      10
      izak
      van
      der
      merwe
      None
    
    
      1716
      2010-505
      2
      ruben
      ramirez
      hidalgo
      None
      None
    
    
      1754
      2010-5053
      9
      alessio
      di
      mauro
      None
      None
    
    
      1857
      2010-520
      23
      edouard
      roger
      vasselin
      None
      None
    
    
      2356
      2010-580
      48
      juan
      martin
      del
      potro
      None
    
    
      2822
      2011-2276
      8
      alex
      bogomolov
      jr
      None
      None
    
    
      3014
      2011-316
      2
      jonathan
      dasnieres
      de
      veigy
      None
    
    
      3075
      2011-321
      9
      cedrik
      marcel
      stebe
      None
      None
    
    
      3472
      2011-404
      1
      rik
      de
      voest
      None
      None
    
    
      4111
      2011-440
      6
      jesse
      huta
      galung
      None
      None
    
    
      4464
      2011-506
      6
      juan
      pablo
      brzezicki
      None
      None
    
    
      4803
      2011-560
      33
      rogerio
      dutra
      silva
      None
      None
    
    
      4821
      2011-560
      51
      jean
      rene
      lisnard
      None
      None
    
    
      5233
      2011-6242
      25
      pierre
      ludovic
      duclos
      None
      None
    
    
      6038
      2012-360
      8
      sergio
      gutierrez
      ferrol
      None
      None
    
    
      6136
      2012-403
      21
      roberto
      bautista
      agut
      None
      None
    
    
      6867
      2012-468
      5
      inigo
      cervantes
      huegun
      None
      None
    
    
      7333
      2012-540
      34
      kenny
      de
      schepper
      None
      None
    
    
      8448
      2013-316
      11
      jan
      lennard
      struff
      None
      None
    
    
      8704
      2013-352
      16
      pierre
      hugues
      herbert
      None
      None
    
    
      8747
      2013-360
      12
      pablo
      carreno
      busta
      None
      None
    
    
      8771
      2013-375
      9
      arnau
      brugues
      davi
      None
      None
    
    
      9802
      2013-506
      5
      diego
      sebastian
      schwartzman
      None
      None
    
    
      11365
      2014-360
      11
      roberto
      carballes
      baena
      None
      None
    
    
      13606
      2015-315
      5
      john
      patrick
      smith
      None
      None
    
    
      13967
      2015-360
      6
      arthur
      de
      greef
      None
      None
    
    
      15697
      2015-6718
      1
      daniel
      elahi
      galan
      riveros
      None
    
    
      15706
      2015-6718
      10
      alejandro
      gomez
      gb42
      None
      None
    
    
      15708
      2015-6718
      12
      adrian
      menendez
      maceiras
      None
      None
    
    
      15901
      2015-741
      12
      taylor
      harry
      fritz
      None
      None
    
  








    



o_bad_winners






    







  
    
      
      ID
      0_W
      1_W
      2_W
      3_W
      4_W
    
  
  
    
      205
      2010-891
      de
      bakker
      t.
      None
      None
    
    
      406
      2010-451
      el
      aynaoui
      y.
      None
      None
    
    
      589
      2016-0451
      munoz-de
      la
      nava
      d.
      None
    
    
      753
      2015-301
      de
      schepper
      k.
      None
      None
    
    
      833
      2011-338
      del
      potro
      j.m.
      None
      None
    
    
      1893
      2010-5012
      van
      der
      merwe
      i.
      None
    
    
      1913
      2011-5012
      de
      voest
      r.
      None
      None
    
    
      1923
      2011-5012
      huta
      galung
      j.
      None
      None
    
    
      2146
      2012-505
      dutra
      silva
      r.
      None
      None
    
    
      2152
      2012-505
      del
      bonis
      f.
      None
      None
    
    
      2232
      2015-7161
      estrella
      burgos
      v.
      None
      None
    
    
      2250
      2016-7161
      carballes
      baena
      r.
      None
      None
    
    
      5681
      2010-360
      el
      amrani
      r.
      None
      None
    
    
      5821
      2015-360
      de
      greef
      a.
      None
      None
    
    
      6038
      2016-0360
      carreno
      busta
      p.
      None
      None
    
    
      6387
      2016-0410
      bautista
      agut
      r.
      None
      None
    
    
      7019
      2010-5053
      di
      mauro
      a.
      None
      None
    
    
      11077
      2011-316
      dasnieres
      de
      veigy
      j.
      None
    
    
      12695
      2016-0319
      munoz
      de
      la
      nava
      d.
    
    
      14337
      2011-560
      dutra
      da
      silva
      r.
      None



In [ ]:

    
temp=pd.merge(o_bad_winners,m_bad_winners,how="inner",left_on=['0_W','1_W'],right_on=['1_W','2_W'])
temp



In [ ]:

    
len(o_bad_losers)



In [ ]:

    
m_rounds = matches['round'].unique()
o_rounds = odds['Round'].unique()
print('m_rounds:  ', m_rounds)
print('o_rounds:  ',o_rounds)



In [ ]:

    
# eliminate "BR", since it corresponds to a match for which there are no odds
m_rounds = m_rounds[m_rounds!='BR']
m_rounds



In [ ]:

    
t=round_lookup.index[round_lookup.m_rounds=='R16'].tolist()
t



In [ ]:

    
def build_round_idx(tourney_id,m_data,o_data):



In [ ]:

    
def get_match_ID(tourney_id,m_data, o_data):
    """
    function:  get_match_ID(odds_row)
    Input:  row from dataframe odds
    Output:  match ID from corresponding tournament in matches
    """

    
    
    if (m!=n):
        print("Warning:  different number of matches in tournament {}".format(match_id))
    
    o_WSplitNames = o_data.Winner.str.split(pat=' ',expand=True)
    o_LSplitNames = o_data.Loser.str.split(pat=' ',expand=True)
    
    record = m.loc[ (matches.tourney_id == odds_row.loc['ID']) & 
                                    matches.winner_name.str.contains(Winner[0]) &
                                    matches.loser_name.str.contains(Loser[0]),'match_num']
                         
    return record

Two-dimension problem: Rank + Height



In [ ]:

    
# extract from MySQL database info on rank points and height for both winner and loser, store in dataframe
with engine.begin() as connection:
    rawdata = pd.read_sql_query("""SELECT winner_rank_points, loser_rank_points, winner_ht, loser_ht FROM matches \
                            WHERE tourney_date < '20150101' \
                            AND tourney_date > '20110101'
                            AND winner_rank_points IS NOT NULL \
                            AND loser_rank_points IS NOT NULL \
                            AND winner_ht IS NOT NULL \
                            AND loser_ht IS NOT NULL""", connection)



In [ ]:

    
# winner rank minus loser rank
rank_diff = (rawdata['winner_rank_points'] - rawdata['loser_rank_points']).values
# winner height minus loser height
ht_diff = (rawdata['winner_ht']-rawdata['loser_ht']).values
# index variable:  True if higher ranked player won, False otherwise
y = (rank_diff > 0)
# higher ranked height minus lower ranked height
rank_ht_diff = np.where(y==True, ht_diff,-ht_diff)
# final dataset with two cols: difference in rankings, high ranked height minus low ranked height  
X = np.column_stack([np.abs(rank_diff), rank_ht_diff])



In [ ]:

    
# for numerical well-behavedness, we need to scale and center the data
X=(X-np.mean(X,axis=0))/np.std(X,axis=0)
# prepend column of 1s to X
X=np.insert(X,0,1,axis=1)

Balance the numbers of correct and incorrect predictions via oversampling.



In [ ]:

    
# number of "True" minus the number of "False"
nsamps = sum(y) - sum(~y)
# the smaller class can be associated with either "True" or "False"
smallclass = bool(nsamps<0)
# sample with replacement from data of the appropriate class
samps = X[np.random.choice(np.where(y==smallclass)[0],size=nsamps),:]
# augment the old data
X1 = np.concatenate((X,samps),axis=0)
y1 = np.concatenate((y,np.zeros(nsamps,dtype=bool)))



In [ ]:

    
# plot the normalized data
fig, ax = plt.subplots(1,1)
ax.plot(X1[y1,1],X1[y1,2],"ro")
ax.plot(X1[~y1,1],X1[~y1,2],"bo")
ax.set_xlabel('Rank difference')
ax.set_ylabel('Height')
ax.set_title('Higher-rank-wins as a function of rank difference and height')
ax.legend(['High rank wins','Low rank wins'])

To perform the regression, we'll need to define the sigmoid function and a cost function. The former can take a scalar, vector, or matrix, and return the elementwise value of

$$ \frac{1}{1+e^{-z}} $$



In [ ]:

    
def sigmoid(z):
    '''
    Usage:  sigmoid(z)
    
    Description:  Computes value of sigmoid function for scalar.  
    For vector or matrix, computes values of sigmoid function for each entry.
    '''

    return 1/(1+np.exp(-z));



In [ ]:

    
# define a cost function
def costFunction(theta,X,y,lam):
    '''
    Computes the cost and gradient for logistic regression.
    
    Input: 
           theta (3x1 vector of parameters)
           X (nx3 matrix of feature values, first column all 1s)
           y (nx1 binary vector of outcomes, 1=higher ranked player won, 0 otherwise)
           lam (scalar:  regularization paramter)
           
    Output:  
           cost (scalar value of cost)
    '''

    # number of data points
    m = len(y) 
    # make sure vectors are column vectors
    theta = theta.reshape(-1,1)
    y = y.reshape(-1,1)
    
    # input to sigmoid function will be a column vector
    z = np.dot(X,theta)
    
    # cost function
    J = (1/m)*np.sum(np.dot(-y.transpose(),np.log(sigmoid(z))) - \
                     np.dot((1-y.transpose()),np.log(1-sigmoid(z)))) + \
                    (lam/(2*m))*np.sum(theta[1:len(theta)+1]**2);
  
    # gradient
    regterm = np.insert(theta[1:len(theta)+1],0,0)
    grad = (1/m)*np.sum((sigmoid(z) - y)*X,0) + (lam/m)*regterm
    
    
    return J, grad

Small test: make sure the cost function works.



In [ ]:

    
# check that cost function works
theta = np.array([1,2,3])
lam = 0
cost, grad = costFunction(theta, X1, y1*1,lam)
print("cost:", cost)
print("grad:", grad)

For diagnostic purposes, we define a callback function that will print information about the state and gradient as the optimization algorithm proceeds.



In [ ]:

    
def callbackF(theta):
    global NFeval
    global X1
    global y1
    global lam
    cost,grad = costFunction(theta,X1,y1*1,lam)
    print("%4d   %3.6f   %3.6f   %3.6f   %3.6f   %3.6f  %3.6f  %3.6f" % \
          (NFeval, theta[0], theta[1], theta[2], cost, grad[0], grad[1], grad[2]))  
    NFeval+=1

Finally, we run the optimization.



In [ ]:

    
# run optimization
NFeval = 1
#initial_theta = np.array([-5,4,3])
initial_theta = np.array([1])
#print("iter      t1          t2         t3     cost      grad1       grad2       grad3")
#res = sp.optimize.minimize(lambda t:  costFunction(t,X1,y1*1,lam), initial_theta, method='CG',\
#                           jac=True,options={'maxiter':100,'disp':True}, callback=callbackF)   
res = sp.optimize.minimize(lambda t:  costFunction(t,X1,y1*1,lam), initial_theta, method='CG',\
                           jac=True,options={'maxiter':100,'disp':True})

To see how it did, we replot the data with the logistic classifier superimposed over the top.



In [ ]:

    
# plot the normalized data with regression line
theta = res.x
fig, ax = plt.subplots(1,1)
ax.plot(X1[y1,1],X1[y1,2],"ro",markerfacecolor="None")
ax.plot(X1[~y1,1],X1[~y1,2],"bo",markerfacecolor="None")
xplot = np.array([-3,3])
yplot = (-1/theta[2])*(theta[1]*xplot+theta[0])
ax.plot(xplot,yplot,'g',linewidth=2)
ax.set_xlabel('Rank difference')
ax.set_ylabel('Height')
ax.set_title('Higher-rank-wins as a function of age and height')
ax.set_ylim((-5,5))

Also develop a quantitative measure of success: count the number of correct predictions, and compare to what would have been predicted by rank alone.



In [ ]:

    
rank_success_rate = np.sum(y)/len(y)
y_pred = (np.dot(X,theta)>0)
prediction_success_rate = np.sum(~(y^y_pred))/len(y)
print("prediction success rate: ", prediction_success_rate)
print("rank success rate: ", rank_success_rate)

Try to duplicate the above with sklearn's canned logistic algorithm



In [ ]:

    
t=lr.get_params()



In [ ]:

    
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C=1., solver='lbfgs')
t = lr.fit(X1,y1*1)



In [ ]:

    
t=lr.decision_function(X)

Comments:

The above procedure turns tennis match outcomes into training data by assigning 1s to matches in which the higher ranked opponent won.

TODO:

investigate other methods of classifying matches as 1s or 0s
investigate other combinations of features to use for the regression
test feature choices by dividing data set into training, validation, and test sets

III. SVM Demo

This section uses some basic scikit functionality to train an SVM classifier on rank and ace data. The example is again a trivial one, but defines a workflow and sets the stage for investigating more complex relations.



In [ ]:

    
# we'll use the SVM package in the scikit library
from sklearn import svm

After classifying the SVM classifier, we'll need some helper functions to form contour plots. These helper functions are borrowed from the scikit documentation, http://scikit-learn.org/stable/auto_examples/svm/plot_iris.html#sphx-glr-auto-examples-svm-plot-iris-py.



In [ ]:

    
# produce a dense grid of points in rectangle around the data
def make_meshgrid(x, y, h=.02):
    """Create a mesh of points to plot in

    Parameters
    ----------
    x: data to base x-axis meshgrid on
    y: data to base y-axis meshgrid on
    h: stepsize for meshgrid, optional

    Returns
    -------
    xx, yy : ndarray
    """
    x_min, x_max = x.min() - 1, x.max() + 1
    y_min, y_max = y.min() - 1, y.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    return xx, yy

# produce a contour plot with predicted outcomes from SVM classifier
def plot_contours(ax, clf, xx, yy, **params):
    """Plot the decision boundaries for a classifier.

    Parameters
    ----------
    ax: matplotlib axes object
    clf: a classifier
    xx: meshgrid ndarray
    yy: meshgrid ndarray
    params: dictionary of params to pass to contourf, optional
    """
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    out = ax.contourf(xx, yy, Z, **params)
    return out

We'll run our test on a slightly different set of data than last time. Here, we'll still classify matches as 1 if the higher ranked player wins and 0 otherwise, but we'll focus on age and height as our predictive features.



In [ ]:

    
# extract from MySQL database info on rank points and height for both winner and loser, store in dataframe
with engine.begin() as connection:
    rawdata = pd.read_sql_query("""SELECT winner_rank_points, loser_rank_points, winner_age, loser_age, winner_ht, loser_ht \
                            FROM matches \
                            WHERE tourney_date > '20170101' \
                            AND winner_rank_points IS NOT NULL \
                            AND loser_rank_points IS NOT NULL \
                            AND winner_age IS NOT NULL \
                            AND loser_age IS NOT NULL \
                            AND winner_ht IS NOT NULL \
                            AND loser_ht IS NOT NULL""", connection)



In [ ]:

    
# dictionary connecting `matches` tourney_name to `odds` Location
tourney_pairs =  \
[('Brisbane','Brisbane')  ,
('Chennai','Chennai')  ,
('Doha','Doha')  ,
('Auckland','Auckland')  ,
('Sydney','Sydney')  ,
('Australian Open','Melbourne')  ,
('Johannesburg','Johannesburg')  ,
('Santiago','Santiago')  ,
('Zagreb','Zagreb')  ,
('Costa Do Sauipe','Costa Do Sauipe')  ,
('Rotterdam','Rotterdam')  ,
('San Jose','San Jose')  ,
('Buenos Aires','Buenos Aires')  ,
('Marseille','Marseille')  ,
('Memphis','Memphis')  ,
('Acapulco','Acapulco')  ,
('Delray Beach','Delray Beach')  ,
('Dubai','Dubai')  ,
('Indian Wells Masters','Indian Wells')  ,
('Miami Masters','Miami')  ,
('Casablanca','Casablanca')  ,
('Houston','Houston')  ,
('Monte Carlo Masters','Monte Carlo')  ,
('Barcelona','Barcelona')  ,
('Rome Masters','Rome')  ,
('Belgrade','Belgrade')  ,
('Munich','Munich')  ,
('Estoril','Estoril')  ,
('Madrid Masters', 'Madrid')  ,
('Roland Garros','Paris')  ,
('Nice','Nice')  ,
("Queen's Club",'Queens Club')  ,
('Halle','Halle')  ,
('s-Hertogenbosch',"'s-Hertogenbosch")  ,
('Eastbourne','Eastbourne')  ,
('Wimbledon','London')  ,
('Newport','Newport')  ,
('Stuttgart','Stuttgart')  ,
('Bastad','Bastad')  ,
('Hamburg','Hamburg')  ,
('Atlanta','Atlanta')  ,
('Los Angeles','Los Angeles')  ,
('Gstaad', 'Gstaad')  ,
('Washington','Washington')  ,
('Umag', 'Umag')  ,
('Canada Masters','Toronto')  ,
('Canada Masters','Montreal')  ,
('Cincinnati Masters','Cincinnati')  ,
('US Open','New York')  ,
('Metz','Metz')  ,
('Bucharest','Bucharest')  ,
('Kuala Lumpur','Kuala Lumpur')  ,
('Bangkok', 'Bangkok')  ,
('Tokyo','Tokyo')  ,
('Beijing', 'Beijing')  ,
('Moscow','Moscow')  ,
('Shanghai Masters','Shanghai')  ,
('Montpellier','Montpellier')  ,
('Stockholm','Stockholm')  ,
('Vienna','Vienna')  ,
('St. Petersburg', 'St. Petersburg')  ,
('Basel','Basel')  ,
('Valencia','Valencia')  ,
('Paris Masters','Paris')  ,
('Tour Finals','London'),
('New Haven', 'New Haven'),
('Kitzbuhel','Kitzbuhel'),
('Winston-Salem','Winston-Salem'),
('Sao Paulo','Sao Paulo'),
('Bogota','Bogota'),
('Vina del Mar','Vina del Mar'),
('Santiago','Vina del Mar'),
('Estoril','Oeiras'),
('Power Horse Cup','Dusseldorf'),
('Rio de Janeiro','Rio de Janeiro'),
('Dusseldorf','Dusseldorf'),
('Shenzhen','Shenzhen'),
('Quito','Quito'),
('Istanbul','Istanbul'),
('Geneva','Geneva'),
('London','Queens Club'),
('Nottingham','Nottingham'),
('Sofia','Sofia'),
('Marrakech','Marrakech'),
('Los Cabos','Los Cabos'),
('Us Open','New York'),
('St.Petersburg','St. Petersburg'),
('Chengdu','Chengdu'),
('Antwerp','Antwerp'),
('London','London')]

# use dictionary keys and values as columns in a dataframe
tourney_lookup = pd.DataFrame(tourney_pairs,columns = ['m_name','o_name'])
tourney_lookup.m_name = tourney_lookup.m_name.str.lower().str.strip()
tourney_lookup.o_name = tourney_lookup.o_name.str.lower().str.strip()



In [ ]:

    
# this nx2 array contains the differences in ages and the differences in height
X = pd.concat([rawdata.iloc[:,2]-rawdata.iloc[:,3], \
               rawdata.iloc[:,4]-rawdata.iloc[:,5]], axis=1).values

# this nx1 binary array indicates whether the match was a "success" or a "failure", as predicted by ranking differences
y = (rawdata.iloc[:,0]-rawdata.iloc[:,1]).values > 0



In [ ]:

    
# for numerical well-behavedness, we need to scale and center the data
X=(X-np.mean(X,axis=0))/np.std(X,axis=0)



In [ ]:

    
# plot the normalized data 
fig, ax = plt.subplots(1,1)
ax.plot(X[y,0],X[y,1],"ro")
ax.plot(X[~y,0],X[~y,1],"bo")
ax.set_xlabel('Age')
ax.set_ylabel('Height')
ax.set_title('Higher-rank-wins as a function of age and height')

We'll use the scikit svm package to train an SVM classifier on this data. We'll plot the results as a contour graph.



In [ ]:

    
# find the SVM classifier
clf = svm.SVC()
clf.fit(X, y) 

# generate a dense grid for producing a contour plot
X0, X1 = X[:, 0], X[:, 1]
xx, yy = make_meshgrid(X0, X1)

# feed the grid into the plot_contours routinge
fig, ax = plt.subplots(1, 1)
plot_contours(ax, clf, xx, yy,
              cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xlabel('Rank points')
ax.set_ylabel('First serve %')
ax.set_xticks(())
ax.set_yticks(())
ax.set_title('SVM classifier for height/age data')

Note that these features produce a fairly mixed set of points, so they are unlikely to be highly predictive. But it is interesting to note the pocket of blue in the lower left corner: it suggests that, all other being equal, players who are younger and shorter are likely to fair worse than predicted.

IV. Betting

Here we explore the profitability of our algorithms.



In [ ]:

    
# name of database
db_name = "tennis"
# name of db user
username = "testuser"
# db password for db user
password = "test623"
# location of atp data files
atpfile_directory = "../data/tennis_atp-master/"
# location of odds data files
oddsfiles_directory = "../data/odds_data/"

#%%

#
# PACKAGES
#

import sqlalchemy # pandas-mysql interface library
import sqlalchemy.exc # exception handling
from   sqlalchemy import create_engine  # needed to define db interface
import glob # for file manipulation
import sys # for defining behavior under errors

#%%

#
# This cell tries to connect to the mysql database "db_name" with the login
# info supplied above.  If it succeeds, it prints out the version number of 
# mysql, if it fails, it exits gracefully.
#

# create an engine for interacting with the MySQL database
try:
    eng_str = 'mysql+mysqldb://' + username + ':' + password + '@localhost/' + db_name
    engine = create_engine(eng_str)
    connection = engine.connect()
    version = connection.execute("SELECT VERSION()")
    print("Database version : ")
    print(version.fetchone())

# report what went wrong if this fails.    
except sqlalchemy.exc.DatabaseError as e:
    reason = e.message
    print("Error %s:" % (reason))
    sys.exit(1)

# close the connection
finally:            
    if connection:    
        connection.close()
    else:
        print("Failed to create connection.")



In [ ]:

    
# extract from MySQL database info odds
with engine.begin() as connection:
    rawdata = pd.read_sql_query("""SELECT PSW, PSL, WRank, LRank FROM odds \
                            WHERE PSW IS NOT NULL \
                            AND PSL IS NOT NULL \
                            AND WRank IS NOT NULL \
                            AND LRank IS NOT NULL;""", connection)



In [ ]:

    
investment = len(rawdata)
good_call_idx = (rawdata["LRank"]-rawdata["WRank"]>0)
winner_odds = rawdata["PSW"]
gain = sum(winner_odds*good_call_idx) 
roi = gain - investment

print("total invested:  ", investment)
print("return on investment:  ", roi)



In [ ]:

    
np.min(winner_odds)



In [ ]:

    
rawdata

IV. TODOs

This work is just the begining part of this analysis. The gold standard for professional tennis match prediction is to beat the betting market. There is odds data available for many of these matches, and the proper metric for predictive success is probably percentage increase in profit.

Further work includes the following:

implement a neural net
systematically investigate other combinations of features, including polynomial features
use join commands to extract more complex subsets of the data (i.e. court-type specific data, players of a certain origin, players with certain tournament play patterns, etc.)
benchmark algorithms against one another using expected winnings ideas

Scrap



In [ ]:

    
n=10
X = np.random.normal(0,1,(n,2))
y = np.random.choice(a=[False, True], size=n)
X1= np.insert(X,0,1,axis=1)



In [ ]:

    
n0=100
n1=100
m0 = 0
m1 = 0
s0 = 1
s1 = 1
X = np.concatenate((np.random.normal(m0,s0,(n0,2)), \
                    np.concatenate((np.random.normal(0,s1,size=(n1,1)), \
                                    np.random.normal(m1,s1,size=(n1,1))),axis=1)),axis=0)
X1 = np.insert(X,0,1,axis=1)
y = np.concatenate((np.ones((n0),dtype=bool),np.zeros((n1),dtype=bool)))



In [ ]:

    
with engine.begin() as connection:
    rawdata = pd.read_sql_query("""SELECT DISTINCT tourney_date, winner_name, winner_rank_points, winner_rank, \
                            loser_name, loser_rank_points, loser_rank FROM matches \
                            WHERE tourney_date > '20150101' \
                            AND tourney_date < '20160101' \
                            AND tourney_name = "Wimbledon"; """, connection)



In [ ]:

    
with engine.begin() as connection:
    rawdata = pd.read_sql_query("""SELECT odds.Date, matches.tourney_date, odds.Tournament, \
                            SUBSTRING(odds.winner,1,LOCATE(' ',odds.winner)) AS 'Winner', \
                            SUBSTRING(odds.loser,1,LOCATE(' ',odds.loser)) AS 'Loser', \
                            odds.PSW, odds.PSL, \
                            matches.winner_rank_points, matches.loser_rank_points \
                            FROM odds \
                            INNER JOIN matches \
                            ON odds.Date = matches.tourney_date \
                            AND odds.Tournament = matches.tourney_name \
                            AND odds.Tournament = "Wimbledon" \
                            AND odds.Date > '20150101' \
                            ; """, connection)



In [ ]:

    
with engine.begin() as connection:
    rawdata = pd.read_sql_query("""SELECT odds.Date, matches.tourney_date, odds.Tournament, \
                            SUBSTRING(odds.winner,1,LOCATE(' ',odds.winner)), \
                            SUBSTRING(odds.loser,1,LOCATE(' ',odds.loser)) \
                            FROM odds \
                            INNER JOIN matches \
                            ON (odds.Date = matches.tourney_date \
                            AND odds.Tournament = matches.tourney_name \
                            AND odds.Tournament = "Wimbledon" \
                            AND matches.winner_name REGEXP SUBSTRING(odds.winner,1,LOCATE(' ',odds.winner)) \
                            AND matches.loser_name REGEXP SUBSTRING(odds.loser,1,LOCATE(' ',odds.loser))) \
                            ; """, connection)



In [ ]:

    
with engine.begin() as connection:
    rawdata = pd.read_sql_query("""SELECT (matches.winner_name REGEXP SUBSTRING(odds.winner,1,LOCATE(' ',odds.winner))) AS 'Namematch', \
                            SUBSTRING(odds.winner,1,LOCATE(' ',odds.winner)) AS 'Winner', \
                            SUBSTRING(odds.loser,1,LOCATE(' ',odds.loser)) AS 'Loser', \
                            matches.winner_name, matches.loser_name \
                            FROM matches \
                            INNER JOIN odds \
                            ON (odds.Tournament="Wimbledon" AND \
                            odds.Winner = "Tomic B.") \
                            ; """, connection)



In [ ]:

    
# create an engine for interacting with the MySQL database
db_name = "test"
try:
    engine.execute("CREATE DATABASE IF NOT EXISTS " + db_name) #create db
    engine.execute("USE " + db_name) # select new db

    # report what went wrong if this fails.    
except sqlalchemy.exc.DatabaseError as e:
    reason = e.message
    print("Error %s:" % (reason))
    sys.exit(1)



In [ ]:

    
with engine.begin() as connection:
        connection.execute("CREATE TABLE matches (tourney_name VARCHAR(256));")
        connection.execute("ALTER TABLE matches ADD COLUMN tourney_date DATE;")
        connection.execute("ALTER TABLE matches ADD COLUMN winner_name VARCHAR(256);")
        connection.execute("ALTER TABLE matches ADD COLUMN winner_rank_points SMALLINT UNSIGNED;")
        connection.execute("ALTER TABLE matches ADD COLUMN loser_name VARCHAR(256);") 
        connection.execute("ALTER TABLE matches ADD COLUMN loser_rank_points SMALLINT UNSIGNED;")



In [ ]:

    
with engine.begin() as connection:
        query=("""CREATE TABLE odds 
        (Tournament	VARCHAR(256),
         Date DATE,
         Winner VARCHAR(256), 
         Loser VARCHAR(256), 
         PSW DECIMAL(5,3),
         PSL DECIMAL(5,3))
         ;""")
        connection.execute(query)



In [ ]:

    
with engine.begin() as connection:
        query=("""CREATE TABLE t1 
        (name VARCHAR(256),
         id tinyint unsigned)
         ;""")
        connection.execute(query)



In [ ]:

    
with engine.begin() as connection:
        query=("""CREATE TABLE t2 
        (name VARCHAR(256),
         id tinyint unsigned)
         ;""")
        connection.execute(query)



In [ ]:

    
with engine.begin() as connection:
    connection.execute("""insert into t1 (name, id) values \
    ("Bob the Blob", 3), \
    ("Pedro el Gato", 4), \
    ("Josie Slatterly", 5), \
    ("Urs Burs", 4);""")



In [ ]:

    
with engine.begin() as connection:
    connection.execute("""insert into t2 (name, id) values \
    ("Blob B.", 3), \
    ("Gato P.", 5), \
    ("Slatterly J.", 5), \
    ("Burs U.", 3);""")



In [ ]:

    
with engine.begin() as connection:
    rawdata = pd.read_sql_query("""SELECT t1.name, t1.id, t2.name, t2.id, \
                            SUBSTRING(t2.name,1,LOCATE(' ',t2.name)) \
                            FROM t1 \
                            JOIN t2 \
                            ON (t1.name REGEXP SUBSTRING(t2.name,1,LOCATE(' ',t2.name))) \
                            ; """, connection)



In [ ]:

    
with engine.begin() as connection:
    connection.execute("""insert into matches values ('Wimbledon', '20150629', \
    'Jarkko Nieminen', 564, 'Lleyton Hewitt', 118);""")
    connection.execute("""insert into matches values ('Wimbledon', '20150629', \
    'Pierre Hugues Herbert', 353, 'Hyeon Chung', 79);""")
    connection.execute("""insert into matches values ('Wimbledon', '20160629', \
    'Bernard Tomic', 1355, 'Jan Lennard Struff', 112);""")



In [ ]:

    
with engine.begin() as connection:
    connection.execute("""insert into odds values ('Wimbledon', '20150629', \
    'Przysiezny M.','Ljubicic I.', 4.880, 1.230);""")
    connection.execute("""insert into odds values ('Wimbledon', '20150629', \
    'Lopez F.','Levine J.', 1.270, 4.300);""")
    connection.execute("""insert into odds values ('Wimbledon', '2016-06-21', \
    'Lu Y.H.','Zeballos H.', 1.910, 2.020);""")
    connection.execute("""insert into odds values ('Wimbledon', '20160629', \
    'Tomic B.','Struff J.L.', 1.910, 2.020);""")



In [ ]:

    
with engine.begin() as connection:
    connection.execute("""CREATE VIEW odds_names AS (    
    SELECT \
    SUBSTRING(t2.name,1,LOCATE(' ',t2.name)) AS 'Name' \
    FROM t2
    );""")



In [ ]:

    
with engine.begin() as connection:
    connection.execute("""select * from t1 inner join on (t1.name regexp odds_names.Name)""")

Bibliography:



In [ ]:

    
# number of "True" minus the number of "False"
nsamps = sum(y) - sum(~y)
# the smaller class can be associated with either "True" or "False"
smallclass = bool(nsamps<0)
# sample with replacement from data of the appropriate class
samps = X1[np.random.choice(np.where(y==smallclass)[0],size=nsamps)]
# augment the old data
X1 = np.concatenate((X,samps),axis=0)
y1 = np.concatenate((y,np.zeros(nsamps,dtype=bool)))



In [ ]:

    
import IPython
IPython.display.display(matches)



In [ ]:



In [ ]:

    
startdate = '20100101'
enddate = '20171231'
with engine.begin() as connection:
    matchdata = pd.read_sql_query("""SELECT tourney_date AS Date, tourney_name AS Tournament, \
                            winner_id AS WID, winner_name AS WName, winner_rank_points AS WPts, \
                            loser_id AS LID, loser_name AS LName, loser_rank_points AS LPts \
                            FROM matches \
                            WHERE tourney_date < '""" + enddate + """' \
                            AND tourney_date > '""" + startdate + """'  \
                            AND tourney_level = 'G' \
                            AND winner_rank_points IS NOT NULL \
                            AND loser_rank_points IS NOT NULL
                            ;""", connection)
    
    oddsdata = pd.read_sql_query("""SELECT Date, Tournament, \
                            Winner, WPts, Loser, LPts, \
                            PSW, PSL
                            FROM odds \
                            WHERE Date < '""" + enddate + """' \
                            AND Date > '""" + startdate + """' \
                            AND Series = 'Grand Slam' \
                            ;""", connection)

	ATP	Location	Tournament	Date	Series	Court	Surface	Round	BestOf	Winner	...	LBL	PSW	PSL	SJW	SJL	MaxW	MaxL	AvgW	AvgL	fname
0	1	Brisbane	Brisbane International	2010-01-04	ATP250	Outdoor	Hard	1st Round	3	Gasquet R.	...	2.375	1.526	2.740	1.500	2.500	NaN	NaN	NaN	NaN	2010
1	1	Brisbane	Brisbane International	2010-01-04	ATP250	Outdoor	Hard	1st Round	3	Odesnik W.	...	1.571	2.140	1.813	2.250	1.571	NaN	NaN	NaN	NaN	2010
2	1	Brisbane	Brisbane International	2010-01-04	ATP250	Outdoor	Hard	1st Round	3	Gicquel M.	...	2.100	1.676	2.360	1.615	2.200	NaN	NaN	NaN	NaN	2010

	tourney_id	tourney_name	surface	draw_size	tourney_level	tourney_date	match_num	winner_id	winner_seed	winner_entry	...	l_ace	l_df	l_svpt	l_1stIn	l_1stWon	l_2ndWon	l_SvGms	l_bpSaved	l_bpFaced	match_type
0	2010-301	Auckland	Hard	28	A	2010-01-11	1	103752	NaN	None	...	0.0	6.0	64.0	42.0	22.0	11.0	9.0	8.0	12.0	atp_matches
1	2010-301	Auckland	Hard	28	A	2010-01-11	2	104545	NaN	None	...	5.0	5.0	89.0	56.0	37.0	22.0	15.0	4.0	7.0	atp_matches
2	2010-301	Auckland	Hard	28	A	2010-01-11	3	104338	7.0	None	...	4.0	2.0	53.0	29.0	18.0	8.0	7.0	5.0	9.0	atp_matches

	m_name	o_name
17	casablanca	casablanca
18	chengdu	chengdu
19	chennai	chennai
20	cincinnati masters	cincinnati
21	costa do sauipe	costa do sauipe
22	delray beach	delray beach
23	doha	doha
24	dubai	dubai
25	dusseldorf	dusseldorf
58	power horse cup	dusseldorf

	ATP	fname	ID	size	round	Location	match_num	Tournament	Date	Series	...	LBL	PSW	PSL	SJW	SJL	MaxW	MaxL	AvgW	AvgL	key
0	1	2010	2010-339	31	R32	brisbane	1	Brisbane International	2010-01-04	ATP250	...	2.375	1.526	2.740	1.500	2.500	NaN	NaN	NaN	NaN	0
1	1	2010	2010-339	31	R32	brisbane	2	Brisbane International	2010-01-04	ATP250	...	1.571	2.140	1.813	2.250	1.571	NaN	NaN	NaN	NaN	1
2	1	2010	2010-339	31	R32	brisbane	3	Brisbane International	2010-01-04	ATP250	...	2.100	1.676	2.360	1.615	2.200	NaN	NaN	NaN	NaN	2
3	1	2010	2010-339	31	R32	brisbane	4	Brisbane International	2010-01-04	ATP250	...	1.500	2.580	1.581	2.500	1.500	NaN	NaN	NaN	NaN	3

	tourney_id	match_num	round	key	0_W	1_W	2_W	3_W	4_W	0_L	1_L	2_L	3_L	4_L
54	2010-1536	55	F	54	rafael	nadal	None	None	None	roger	federer	None	None	None
50	2010-1536	51	QF	50	nicolas	almagro	None	None	None	jurgen	melzer	None	None	None
48	2010-1536	49	QF	48	roger	federer	None	None	None	ernests	gulbis	None	None	None
49	2010-1536	50	QF	49	david	ferrer	None	None	None	andy	murray	None	None	None
51	2010-1536	52	QF	51	rafael	nadal	None	None	None	gael	monfils	None	None	None

	ID	round	match_num	key	0_W	1_W	2_W	3_W	4_W	0_L	1_L	2_L	3_L	4_L
7722	2010-1536	F	55	7722	nadal	r.	None	None	None	federer	r.	None	None	None
7716	2010-1536	QF	49	7716	almagro	n.	None	None	None	melzer	j.	None	None	None
7718	2010-1536	QF	51	7718	federer	r.	None	None	None	gulbis	e.	None	None	None
7719	2010-1536	QF	52	7719	ferrer	d.	None	None	None	murray	a.	None	None	None
7717	2010-1536	QF	50	7717	nadal	r.	None	None	None	monfils	g.	None	None	None

	tourney_id	match_num	round	key	0_W	1_W	2_W	3_W	4_W	0_L	1_L	2_L	3_L	4_L
36	2010-1536	37	R32	36	guillermo	garcia	lopez	None	None	jo	wilfried	tsonga	None	None
28	2010-1536	29	R32	28	andy	murray	None	None	None	juan	ignacio	chela	None	None
6	2010-1536	7	R64	6	juan	ignacio	chela	None	None	pablo	cuevas	None	None	None
11	2010-1536	12	R64	11	eduardo	schwank	None	None	None	paul	henri	mathieu	None	None
57	2010-1720	3	R32	57	olivier	rochus	None	None	None	juan	martin	del	potro	None
101	2010-2276	20	R16	101	illya	marchenko	None	None	None	ivan	dodig	None	None	None
100	2010-2276	19	R16	100	jurgen	melzer	None	None	None	olivier	rochus	None	None	None
82	2010-2276	1	R32	82	marin	cilic	None	None	None	jan	hajek	None	None	None
85	2010-2276	4	R32	85	ivo	karlovic	None	None	None	antonio	veic	None	None	None
127	2010-301	15	R16	127	michael	lammer	None	None	None	juan	carlos	ferrero	None	None

	ID	round	match_num	key	0_W	1_W	2_W	3_W	4_W	0_L	1_L	2_L	3_L	4_L
7707	2010-1536	R32	40	7707	garcia	lopez	g.	None	None	tsonga	j.w.	None	None	None
7705	2010-1536	R32	38	7705	murray	a.	None	None	None	chela	j.i.	None	None	None
7680	2010-1536	R64	13	7680	chela	j.i.	None	None	None	cuevas	p.	None	None	None
7679	2010-1536	R64	12	7679	schwank	e.	None	None	None	mathieu	p.h.	None	None	None
15205	2010-1720	R32	11	15205	rochus	o.	None	None	None	del	potro	j.m.	None	None
2293	2010-2276	R16	18	2293	cilic	m.	None	None	None	hajek	j.	None	None	None
2294	2010-2276	R16	19	2294	karlovic	i.	None	None	None	veic	a.	None	None	None
2289	2010-2276	R32	14	2289	marchenko	i.	None	None	None	dodig	i.	None	None	None
2291	2010-2276	R32	16	2291	melzer	j.	None	None	None	rochus	o.	None	None	None
627	2010-301	R16	13	627	lammer	m.	None	None	None	ferrero	j.c.	None	None	None

	tourney_id	match_num	round	key	0_W	1_W	2_W	3_W	4_W	0_L	1_L	2_L	3_L	4_L
5117	2011-605	4	RR	5063	jo	wilfried	tsonga	None	None	mardy	fish	None	None	None
5116	2011-605	3	RR	5062	jo	wilfried	tsonga	None	None	rafael	nadal	None	None	None
7780	2012-605	11	RR	7685	juan	martin	del	potro	None	roger	federer	None	None	None
7777	2012-605	8	RR	7682	juan	martin	del	potro	None	janko	tipsarevic	None	None	None

	tourney_id	match_num	round	0_W_x	1_W_x	2_W_x	3_W_x	4_W_x	0_L_x	1_L_x	...	0_W_y	1_W_y	2_W_y	3_W_y	4_W_y	0_L_y	1_L_y	2_L_y	3_L_y	4_L_y
1	2010-1536	2	R64	leonardo	mayer	None	None	None	julien	benneteau	...	mayer	l.	None	None	None	benneteau	j.	None	None	None
2	2010-1536	3	R64	stanislas	wawrinka	None	None	None	marcel	granollers	...	wawrinka	s.	None	None	None	granollers	m.	None	None	None
3	2010-1536	4	R64	mikhail	youzhny	None	None	None	lukas	lacko	...	youzhny	m.	None	None	None	lacko	l.	None	None	None
4	2010-1536	5	R64	ernests	gulbis	None	None	None	albert	montanes	...	gulbis	e.	None	None	None	montanes	a.	None	None	None
5	2010-1536	6	R64	feliciano	lopez	None	None	None	lukasz	kubot	...	lopez	f.	None	None	None	kubot	l.	None	None	None
6	2010-1536	8	R64	victor	hanescu	None	None	None	igor	andreev	...	hanescu	v.	None	None	None	andreev	i.	None	None	None
7	2010-1536	10	R64	david	ferrer	None	None	None	jeremy	chardy	...	ferrer	d.	None	None	None	chardy	j.	None	None	None
8	2010-1536	11	R64	marcos	baghdatis	None	None	None	marco	chiudinelli	...	baghdatis	m.	None	None	None	chiudinelli	m.	None	None	None
9	2010-1536	12	R64	eduardo	schwank	None	None	None	paul	henri	...	schwank	e.	None	None	None	mathieu	p.h.	None	None	None
10	2010-1536	13	R64	ivo	karlovic	None	None	None	evgeny	korolev	...	karlovic	i.	None	None	None	korolev	e.	None	None	None
11	2010-1536	14	R64	jurgen	melzer	None	None	None	kevin	anderson	...	melzer	j.	None	None	None	anderson	k.	None	None	None
12	2010-1536	15	R64	mardy	fish	None	None	None	michael	russell	...	fish	m.	None	None	None	russell	m.	None	None	None
13	2010-1536	16	R64	thomaz	bellucci	None	None	None	pere	riba	...	bellucci	t.	None	None	None	riba	p.	None	None	None
14	2010-1536	17	R64	juan	monaco	None	None	None	simon	greul	...	monaco	j.	None	None	None	greul	s.	None	None	None

	tourney_id	match_num	round_x	0_W_x_x	1_W_x_x	2_W_x_x	3_W_x_x	4_W_x_x	0_L_x_x	1_L_x_x	...	0_W_y_y	1_W_y_y	2_W_y_y	3_W_y_y	4_W_y_y	0_L_y_y	1_L_y_y	2_L_y_y	3_L_y_y	4_L_y_y
0	2010-1536	1	R64	benjamin	becker	None	None	None	carlos	moya	...	becker	b.	None	None	None	moya	c.	None	None	None
1	2010-1536	2	R64	leonardo	mayer	None	None	None	julien	benneteau	...	mayer	l.	None	None	None	benneteau	j.	None	None	None
2	2010-1536	3	R64	stanislas	wawrinka	None	None	None	marcel	granollers	...	wawrinka	s.	None	None	None	granollers	m.	None	None	None
3	2010-1536	4	R64	mikhail	youzhny	None	None	None	lukas	lacko	...	youzhny	m.	None	None	None	lacko	l.	None	None	None
4	2010-1536	5	R64	ernests	gulbis	None	None	None	albert	montanes	...	gulbis	e.	None	None	None	montanes	a.	None	None	None
5	2010-1536	6	R64	feliciano	lopez	None	None	None	lukasz	kubot	...	lopez	f.	None	None	None	kubot	l.	None	None	None
6	2010-1536	8	R64	victor	hanescu	None	None	None	igor	andreev	...	hanescu	v.	None	None	None	andreev	i.	None	None	None
7	2010-1536	10	R64	david	ferrer	None	None	None	jeremy	chardy	...	ferrer	d.	None	None	None	chardy	j.	None	None	None
8	2010-1536	11	R64	marcos	baghdatis	None	None	None	marco	chiudinelli	...	baghdatis	m.	None	None	None	chiudinelli	m.	None	None	None
9	2010-1536	12	R64	eduardo	schwank	None	None	None	paul	henri	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
10	2010-1536	13	R64	ivo	karlovic	None	None	None	evgeny	korolev	...	karlovic	i.	None	None	None	korolev	e.	None	None	None
11	2010-1536	14	R64	jurgen	melzer	None	None	None	kevin	anderson	...	melzer	j.	None	None	None	anderson	k.	None	None	None
12	2010-1536	15	R64	mardy	fish	None	None	None	michael	russell	...	fish	m.	None	None	None	russell	m.	None	None	None
13	2010-1536	16	R64	thomaz	bellucci	None	None	None	pere	riba	...	bellucci	t.	None	None	None	riba	p.	None	None	None
14	2010-1536	17	R64	juan	monaco	None	None	None	simon	greul	...	monaco	j.	None	None	None	greul	s.	None	None	None
15	2010-1536	18	R64	nicolas	almagro	None	None	None	viktor	troicki	...	almagro	n.	None	None	None	troicki	v.	None	None	None
16	2010-1536	20	R64	philipp	petzschner	None	None	None	fabio	fognini	...	petzschner	p.	None	None	None	fognini	f.	None	None	None
17	2010-1536	21	R64	gael	monfils	None	None	None	stephane	robert	...	monfils	g.	None	None	None	robert	s.	None	None	None
18	2010-1536	22	R64	john	isner	None	None	None	christophe	rochus	...	isner	j.	None	None	None	rochus	c.	None	None	None
19	2010-1536	23	R64	santiago	giraldo	None	None	None	philipp	kohlschreiber	...	giraldo	s.	None	None	None	kohlschreiber	p.	None	None	None
20	2010-1536	24	R64	alexandr	dolgopolov	None	None	None	andreas	seppi	...	dolgopolov	o.	None	None	None	seppi	a.	None	None	None
21	2010-1536	25	R32	roger	federer	None	None	None	benjamin	becker	...	federer	r.	None	None	None	becker	b.	None	None	None
22	2010-1536	26	R32	stanislas	wawrinka	None	None	None	leonardo	mayer	...	wawrinka	s.	None	None	None	mayer	l.	None	None	None
23	2010-1536	27	R32	ernests	gulbis	None	None	None	mikhail	youzhny	...	gulbis	e.	None	None	None	youzhny	m.	None	None	None
24	2010-1536	28	R32	feliciano	lopez	None	None	None	oscar	hernandez	...	lopez	f.	None	None	None	hernandez	o.	None	None	None
25	2010-1536	29	R32	andy	murray	None	None	None	juan	ignacio	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
26	2010-1536	30	R32	victor	hanescu	None	None	None	daniel	munoz	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
27	2010-1536	31	R32	david	ferrer	None	None	None	marcos	baghdatis	...	ferrer	d.	None	None	None	baghdatis	m.	None	None	None
28	2010-1536	32	R32	marin	cilic	None	None	None	eduardo	schwank	...	cilic	m.	None	None	None	schwank	e.	None	None	None
29	2010-1536	33	R32	fernando	verdasco	None	None	None	ivo	karlovic	...	verdasco	f.	None	None	None	karlovic	i.	None	None	None
30	2010-1536	34	R32	jurgen	melzer	None	None	None	mardy	fish	...	melzer	j.	None	None	None	fish	m.	None	None	None
31	2010-1536	35	R32	juan	monaco	None	None	None	thomaz	bellucci	...	monaco	j.	None	None	None	bellucci	t.	None	None	None
32	2010-1536	36	R32	nicolas	almagro	None	None	None	robin	soderling	...	almagro	n.	None	None	None	soderling	r.	None	None	None
33	2010-1536	38	R32	gael	monfils	None	None	None	philipp	petzschner	...	monfils	g.	None	None	None	petzschner	p.	None	None	None
34	2010-1536	39	R32	john	isner	None	None	None	santiago	giraldo	...	isner	j.	None	None	None	giraldo	s.	None	None	None
35	2010-1536	40	R32	rafael	nadal	None	None	None	alexandr	dolgopolov	...	nadal	r.	None	None	None	dolgopolov	o.	None	None	None
36	2010-1536	41	R16	roger	federer	None	None	None	stanislas	wawrinka	...	federer	r.	None	None	None	wawrinka	s.	None	None	None
37	2010-1536	42	R16	ernests	gulbis	None	None	None	feliciano	lopez	...	gulbis	e.	None	None	None	lopez	f.	None	None	None
38	2010-1536	43	R16	andy	murray	None	None	None	victor	hanescu	...	murray	a.	None	None	None	hanescu	v.	None	None	None
39	2010-1536	44	R16	david	ferrer	None	None	None	marin	cilic	...	ferrer	d.	None	None	None	cilic	m.	None	None	None
40	2010-1536	45	R16	jurgen	melzer	None	None	None	fernando	verdasco	...	melzer	j.	None	None	None	verdasco	f.	None	None	None
41	2010-1536	46	R16	nicolas	almagro	None	None	None	juan	monaco	...	almagro	n.	None	None	None	monaco	j.	None	None	None
42	2010-1536	47	R16	gael	monfils	None	None	None	guillermo	garcia	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
43	2010-1536	48	R16	rafael	nadal	None	None	None	john	isner	...	nadal	r.	None	None	None	isner	j.	None	None	None
44	2010-1536	49	QF	roger	federer	None	None	None	ernests	gulbis	...	federer	r.	None	None	None	gulbis	e.	None	None	None
45	2010-1536	50	QF	david	ferrer	None	None	None	andy	murray	...	ferrer	d.	None	None	None	murray	a.	None	None	None
46	2010-1536	51	QF	nicolas	almagro	None	None	None	jurgen	melzer	...	almagro	n.	None	None	None	melzer	j.	None	None	None
47	2010-1536	52	QF	rafael	nadal	None	None	None	gael	monfils	...	nadal	r.	None	None	None	monfils	g.	None	None	None
48	2010-1536	53	SF	roger	federer	None	None	None	david	ferrer	...	federer	r.	None	None	None	ferrer	d.	None	None	None
49	2010-1536	54	SF	rafael	nadal	None	None	None	nicolas	almagro	...	nadal	r.	None	None	None	almagro	n.	None	None	None
50	2010-1536	55	F	rafael	nadal	None	None	None	roger	federer	...	nadal	r.	None	None	None	federer	r.	None	None	None
17434	2010-1536	7	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	chela	j.i.	None	None	None	cuevas	p.	None	None	None
17435	2010-1536	9	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	munoz-de	la	nava	d.	None	querrey	s.	None	None	None

	ID	0_W	1_W	2_W	3_W	4_W
205	2010-891	de	bakker	t.	None	None
406	2010-451	el	aynaoui	y.	None	None
589	2016-0451	munoz-de	la	nava	d.	None
753	2015-301	de	schepper	k.	None	None
833	2011-338	del	potro	j.m.	None	None
1893	2010-5012	van	der	merwe	i.	None
1913	2011-5012	de	voest	r.	None	None
1923	2011-5012	huta	galung	j.	None	None
2146	2012-505	dutra	silva	r.	None	None
2152	2012-505	del	bonis	f.	None	None
2232	2015-7161	estrella	burgos	v.	None	None
2250	2016-7161	carballes	baena	r.	None	None
5681	2010-360	el	amrani	r.	None	None
5821	2015-360	de	greef	a.	None	None
6038	2016-0360	carreno	busta	p.	None	None
6387	2016-0410	bautista	agut	r.	None	None
7019	2010-5053	di	mauro	a.	None	None
11077	2011-316	dasnieres	de	veigy	j.	None
12695	2016-0319	munoz	de	la	nava	d.
14337	2011-560	dutra	da	silva	r.	None