Predicting Professional Tennis Match Outcomes

Author: Carl Toews
Date: August 8, 2017

Project Description: This project explores various machine learning techniques on professional tennis data. The data set was compiled by Jeff Sackman (https://github.com/JeffSackmann/tennis_atp) and consists of the Association of Tennis Professionals (ATP) match outcomes for all major ATP matches since 1968. In all there are over half a million records. Features change slightly over the years, but by 2017 include almost 50 elements, including rank, age, seed, and match statistics for both the winner and loser.

Technical specs: I've tested this notebook on Python 3.6.2 and MySQL 5.7.16.

Outline:
I. MySQL database setup
II. One-dimensional logistic regression demo: predicting with rank differences
III. $n$-dimensional logistic regression demo: predicting with other features
III. SVM demo: age and height difference
IV. TODOs

I: MySQL database setup

The data on Sackman's page is in CSV format. While it's easy enough to read such data directly into a pandas data structure, one of my goals in this project is to get some practice with the Pandas-MySQL interface. To set the stage for this work, I first need to transfer all the data from CSV files to a MySQL database. Since this process is involved, I have included it in a separate notebook called setup_mysql_database.

If the data is already in a MySQL database, skip this notebook, but if it is not, or you are not sure, run it. It will first check to see whether the CSV data has already been put into a MySQL database, and if it has not, it will create such a database. In order to run the notebook, you will need the username and password of a MySQL user that has permission to create databases and tables.

Once the MySQL database is in place, you can run the following cells, which establish the connectivity needed to execute the remainder of this notebook.

Configure parameters

Adjust as necessary to accomodate different paths, directories, and usernames.


In [743]:
# name of database
db_name = "tennis"
# name of db user
username = "testuser"
# db password for db user
password = "test623"
# location of atp data files
atpfile_directory = "./data/tennis_atp-master/"
# location of odds data files
oddsfiles_directory = "./data/odds_data/"

Import packages

All import statements related to connectivity and error analysis are here.


In [744]:
import sqlalchemy # pandas-mysql interface library
import sqlalchemy.exc # exception handling
from   sqlalchemy import create_engine  # needed to define db interface
import sys # for defining behavior under errors
from IPython.core.debugger import Tracer
#%qtconsole

Establish database connection

Try to connect to the tennis database on the local mysql host. If successful, print out the MySQL version number, if unsuccessful, exit gracefully.


In [745]:
# create an engine for interacting with the MySQL database
try:
    eng_str = 'mysql+mysqldb://' + username + ':' + password + '@localhost/' + db_name
    engine = create_engine(eng_str)
    connection = engine.connect()
    version = connection.execute("SELECT VERSION()")
    print("Database version : ")
    print(version.fetchone())
    
# report what went wrong if this fails.    
except sqlalchemy.exc.DatabaseError as e:
    reason = e.message
    print("Error %s:" % (reason))
    sys.exit(1)

# close the connection
finally:            
    if connection:    
        connection.close()
    else:
        print("Failed to create connection.")


Database version : 
('5.7.16',)

II. One-dimensional Logistic regression: rank as predictor

We start out by performing logistic regression on rank alone, with the aim of seeing to what extent rank difference can be used as predictive tool. The development is similar to the one here:

  1. Clarke and Kyte, "Using official ratings to simulate major tennis tournaments", International Transactions in Operational Research, 2000.

Our work flow will involve SQL queries, pandas data frames, and numpy arrays, more or less as follows:

  1. use SQL query to extract data and store it in a pandas data frame
  2. use the pandas data frame to manipulate data and extract features
  3. extract preprocessed data into a numpy array for computational work

Import Statements

We will need some scientific computing libraries (scipy), some data handling libraries (pandas), and some plotting functionality (matplotlib)


In [746]:
import numpy as np # numerical libraries
import scipy as sp
import pandas as pd # for data analysis
import pandas.io.sql as sql # for interfacing with MySQL database
from scipy import linalg # linear algebra libraries
from scipy import optimize
import matplotlib as mpl # a big library with plotting functionality
import matplotlib.pyplot as plt # a subset of matplotlib with most of the useful tools
import IPython as IP
%matplotlib inline 
import pdb

Data extraction

We'll focus on data since 2011, since that is the data for which I have information on the betting markets. The MySQL query avoids any match for which one or another of the players has no rank points. We extract winner rank points and loser rank points.


In [ ]:
# extract from MySQL database info on rank points and height for both winner and loser, store in dataframe
with engine.begin() as connection:
    rawdata = pd.read_sql_query("""SELECT winner_rank_points, loser_rank_points FROM matches \
                            WHERE tourney_date < '20150101' \
                            AND tourney_date > '20110101'
                            AND winner_rank_points IS NOT NULL \
                            AND loser_rank_points IS NOT NULL""", connection)

Data processing

If the winner had a higher rank than the loser, the 'rank prediction' was considered a 'success', otherwise it is a 'failure'. For analysis, we store only the absolute value of the rank difference, along with the outcome of the match. We also scale the data for numerical well-behavedness.


In [ ]:
# winner rank minus loser rank
rank_diff = (rawdata['winner_rank_points'] - rawdata['loser_rank_points']).values
# index variable:  True if higher ranked player won, False otherwise
y = (rank_diff > 0)
# final dataset with two cols: difference in rankings, high ranked height minus low ranked height  
X = np.abs(rank_diff)
# for numerical well-behavedness, we need to scale and center the data
#X1=(X-np.mean(X,0))/np.std(X,axis=0)
X=X/np.std(X,axis=0)

Define helper functions

To perform the regression, we'll need to define the sigmoid function and a cost function. The former can take a scalar, vector, or matrix, and return the elementwise value of

$$ \frac{1}{1+e^{-z}} $$

In [ ]:
def sigmoid(z):
    '''
    Usage:  sigmoid(z)
    
    Description:  Computes value of sigmoid function for scalar.  
    For vector or matrix, computes values of sigmoid function for each entry.
    '''

    return 1/(1+np.exp(-z));

The cost function is designed to take a regularization parameter lambda. For a non-regularized solution, lambda can be set equal to 0. The cost function returns both a cost and the gradient for any given value of parameters $\theta$.


In [ ]:
# define a cost function
def costFunction(theta,X,y,lam,offset=True):
    '''
    Computes the cost and gradient for logistic regression.
    
    Input: 
           theta (mx1 numpy array of parameters)
           X (nxm numpy array of feature values, first column all 1s)
           y (nx1 boolean array of outcomes, 1=higher ranked player won, 0 otherwise)
           lam (scalar:  regularization paramter)
           offset (bool:  True of first element of theta represents a translation, False otherwise)
           
    Output:  
           cost (scalar value of cost)
    '''

    # number of data points
    n = len(y) 
    #number of parameters
    m = len(theta)   
 
    # make sure vectors are column vectors for use of "np.dot"
    theta = theta.reshape(-1,1)
    y = y.reshape(-1,1)
    X = X.reshape(-1,1) if m==1 else X
    
    # input to sigmoid function will be a column vector
    z = np.dot(X,theta) if len(theta) > 1 else X*theta
    
    # cost function
    regterms = np.arange(1,m) if offset else np.arange(0,m)
    #Tracer()() 
    J = (1/n)*(-np.dot(y.transpose(),np.log(sigmoid(z))) - \
                     np.dot((1-y.transpose()),np.log(1-sigmoid(z))) + \
                    (lam/(2))*np.sum(theta[regterms]**2))
  
    # gradient
    #Tracer()()
    reggrad = np.insert(theta[regterms],0,0)
    grad = (1/n)*np.sum((sigmoid(z) - y)*X,0) + (lam/n)*reggrad
    
    
    return np.squeeze(J), np.squeeze(grad)

Small test: make sure the cost function works.


In [ ]:
# check that cost function works
theta = np.array([1.0])
lam = 0
cost, grad = costFunction(theta, X, y*1,lam)
print("cost:", cost)
print("grad:", grad)

For diagnostic purposes, we define a callback function that will print information about the state and gradient as the optimization algorithm proceeds.


In [ ]:
def callbackF(theta):
    global NFeval
    global X
    global y
    global lam
    cost,grad = costFunction(theta,X,y*1,lam)
    print("%4d   %3.6f  %3.6f  %3.6f" % \
          (NFeval, theta, cost, grad)) 
    NFeval+=1

Optimize

Finally, we perform the logistic regression using scipy's built-in optimization.minimize function.


In [ ]:
NFeval = 1
initial_theta = np.array([.1])
print("iter      theta     cost      grad  ")
res = sp.optimize.minimize(lambda t:  costFunction(t,X,y*1,lam), initial_theta, method='CG',\
                           jac=True,options={'maxiter':100,'disp':True},callback=callbackF)

In [ ]:
# histogram empirical success probabilities by 
hist, bin_edges = np.histogram(X,bins=100)
p = [np.sum(y[np.where((X>=bin_edges[i]) & (X<bin_edges[i+1]))[0]])/hist[i] for i in np.arange(len(bin_edges)-1)]
bar_pos = np.arange(len(p))
bar_width = np.diff(bin_edges)
plt.bar(bin_edges[0:-1], p, width=bar_width, align='edge', alpha=0.5)
r = np.arange(X.min(),X.max(),.1)
#s = 1/(1+np.exp(-res.x*r))
s = 1/(1+np.exp(-res.x*r))
plt.plot(r,s)

Compare predictions to those from the betting odds

We'll evaluate the success of our methods by comparing our predictions to those implied by the betting markets. In order to do this, we need to establish a 1-1 correspondence between matches in the odds database and matches in the matches database.

Data wrangling

Matching is not straightforward, unfortunately. The odds data contains the variables Location and Tournament, while the matches data contains tourney_name. But the names are not consistent across these datasets: for example, the tournament "French Open" in odds is called "Roland Garros" in matches. The Location variable in odds is generally a pretty close match to the tourney_name in matches, but not always. Moreoever, the date variables differ slightly between these datasets: sometimes, every match in a tournament is pegged with a single date (generally the start date), sometimes each match has the date it was actually played.

The following code attempt to work through these and related issues. There is a fair bit of ugly data wrangling involved: my solution is ultimately to build a lookup table connecting Location in odds to tourney_name in matches. The part of the table where there is an actual match between these variables is easy to build; the other part involves some manual inspection of the underlying CSV files. The code builds the easy part of the table, and then flags which matches still need to be identified manually. I focus on data in the 2010-2016 range.


Get the data from the appropriate timefame


In [803]:
# focus on most recent data; exclude Davis Cup stuff
startdate = '20100101'
enddate = '20161231'

with engine.begin() as connection:
    odds = pd.read_sql_query("""SELECT * FROM odds \
                                WHERE DATE >= '""" + startdate + """' \
                                AND DATE <= '""" + enddate + """';""", connection)
    
with engine.begin() as connection:
    matches = pd.read_sql_query("""SELECT * FROM matches \
                                WHERE tourney_date >= '""" + startdate + """' \
                                AND tourney_date <= '""" + enddate + """' \
                                AND tourney_name NOT LIKE 'Davis%%';""", connection)
    
# view results
IP.display.display(odds[0:3])
IP.display.display(matches[0:3])


ATP Location Tournament Date Series Court Surface Round BestOf Winner ... LBL PSW PSL SJW SJL MaxW MaxL AvgW AvgL fname
0 1 Brisbane Brisbane International 2010-01-04 ATP250 Outdoor Hard 1st Round 3 Gasquet R. ... 2.375 1.526 2.740 1.500 2.500 NaN NaN NaN NaN 2010
1 1 Brisbane Brisbane International 2010-01-04 ATP250 Outdoor Hard 1st Round 3 Odesnik W. ... 1.571 2.140 1.813 2.250 1.571 NaN NaN NaN NaN 2010
2 1 Brisbane Brisbane International 2010-01-04 ATP250 Outdoor Hard 1st Round 3 Gicquel M. ... 2.100 1.676 2.360 1.615 2.200 NaN NaN NaN NaN 2010

3 rows × 43 columns

tourney_id tourney_name surface draw_size tourney_level tourney_date match_num winner_id winner_seed winner_entry ... l_ace l_df l_svpt l_1stIn l_1stWon l_2ndWon l_SvGms l_bpSaved l_bpFaced match_type
0 2010-301 Auckland Hard 28 A 2010-01-11 1 103752 NaN None ... 0.0 6.0 64.0 42.0 22.0 11.0 9.0 8.0 12.0 atp_matches
1 2010-301 Auckland Hard 28 A 2010-01-11 2 104545 NaN None ... 5.0 5.0 89.0 56.0 37.0 22.0 15.0 4.0 7.0 atp_matches
2 2010-301 Auckland Hard 28 A 2010-01-11 3 104338 7.0 None ... 4.0 2.0 53.0 29.0 18.0 8.0 7.0 5.0 9.0 atp_matches

3 rows × 50 columns

Tidy up the strings (strip whitespace, convert to lowercase, replace dashes by space)


In [879]:
odds[['Location','Winner','Loser']] = \
    odds[['Location','Winner','Loser']].apply(lambda x: x.str.strip().str.lower().str.replace('-',' '),axis=1)
matches[['tourney_name','winner_name','loser_name']] = \
    matches[['tourney_name','winner_name','loser_name']].apply(lambda x: x.str.strip().str.lower().str.replace('-',' '),axis=1)

Group both odds and matches by unique tournament


In [805]:
g_matches = matches.groupby('tourney_id')
g_odds= odds.groupby(['ATP','fname'])

Extract features to assist in "tourney-matching".


In [806]:
def extract_odds_features(group):
    sizes = len(group)
    min_date = group['Date'].min()
    max_date = group['Date'].max()
    location = group['Location'].unique()[0]
    return pd.Series({'size': sizes,'min_date':min_date, 'max_date':max_date,'location':location})

def extract_matches_features(group):
    sizes = len(group)
    min_date = group['tourney_date'].min()
    max_date = group['tourney_date'].max()
    tourney_name = group['tourney_name'].unique()[0]
    return pd.Series({'size': sizes,'min_date':min_date, 'max_date':max_date,'tourney_name':tourney_name})
    
g_odds = g_odds.apply(extract_odds_features).reset_index()
g_matches = g_matches.apply(extract_matches_features).reset_index()

Define lookup-table connecting Location in odds to tourney_name in matches


In [807]:
tourney_lookup = pd.read_csv('tourney_lookup.csv')
print("Snapshot of lookup table:")
IP.display.display(tourney_lookup.sort_values('o_name')[15:25])


Snapshot of lookup table:
m_name o_name
17 casablanca casablanca
18 chengdu chengdu
19 chennai chennai
20 cincinnati masters cincinnati
21 costa do sauipe costa do sauipe
22 delray beach delray beach
23 doha doha
24 dubai dubai
25 dusseldorf dusseldorf
58 power horse cup dusseldorf

Define a function that will take as input a set of features from one tournament in odds, and try to match it to one tournament in matches.


In [808]:
def get_tourney_ID(o_row):
    """
    function:  get_tourney_ID(o_row)
    Input:   row from dataframe g_odds
    Output:  a Series object with two elements: 1) a match ID,
                and 2), a flag of True if the sizes of the two tournmanets are identical
    """

    # calculate the diffence in start/stop dates between this tournament and those in `matches`.
    min_date_delta = np.abs(g_matches['min_date'] - o_row['min_date']).apply(lambda x: x.days)
    max_date_delta = np.abs(g_matches['max_date'] - o_row['max_date']).apply(lambda x: x.days)

    # find a list of candidate tournament names, based on lookup table 
    mtchs = (tourney_lookup['o_name']==o_row['location'])
    if sum(mtchs)>0:
        m_name = tourney_lookup.loc[mtchs,'m_name']
    else:
        print('no match found for record {}'.format(o_row['location']))
        return ['Nan','Nan']
    
    # the "right" tournament has the right name, and reasonable close start or stop dates
    idx = ((min_date_delta <=3) | (max_date_delta <=1)) & (g_matches['tourney_name'].isin(m_name))
    record = g_matches.loc[idx,'tourney_id']

    #print("ATP {}, {}, {}".format(o_row['ATP'],o_row['location'],o_row['fname']))
    #pdb.set_trace()
    
    # if there are no matches, print some diagnostic information and don't assign a match
    if len(record)<1:
        print("Warning:  no match found for `odds` match {}, year {}".format(o_row.ATP, o_row.fname))
        print("min date delta: {}, max date delta: {}, g_matches: {}".format(np.min(min_date_delta), \
                                                                             np.min(max_date_delta), \
                                                                             g_matches.loc[g_matches['tourney_name'].isin(m_name),'tourney_name'])) 
        return pd.Series({'ID':'None','size':'NA'})
    
    # if there are too many matches, print a warning and don't assign a match.
    elif (len(record)>1):
        print("Warning:  multiple matches found for `odds` match {}".format(o_row.ATP))
        return pd.Series({'ID':'Multiple','size':'NA'})

    # otherwise, assign a match, and check if the sizes of the matches are consistent (a good double-check)
    else:
        size_flag = (g_matches.loc[idx,'size']==o_row['size'])
        return pd.Series({'ID':record.iloc[0],'size':size_flag.iloc[0]})

Perform the matches.


In [809]:
# add columns to g_odds to hold match ID and also info about size-correspondence
g_odds.insert(len(g_odds.columns),'ID','None')
g_odds.insert(len(g_odds.columns),'sizes_match','NA')

# perform the match
g_odds[['ID','sizes_match']] = g_odds.apply(get_tourney_ID,axis=1).values

Merge match numbers back into bigger odds table if the previous routine produced no errors.


In [810]:
# merge on ATP and year
if sum(g_odds['sizes_match']==True) == len(g_odds):
    odds = pd.merge(g_odds[['ATP','fname','ID','size']],odds,how='inner',on=['ATP','fname'])
else:
    print("At least one tournament in `odds` is matched to a tournament in `matches` of a different size.")

In [811]:
matches = pd.merge(g_matches[['tourney_id','size']],matches,how='inner',on=['tourney_id'])

In [812]:
# match numbers to odds
odds.insert(5,'match_num',0)
grouped = odds[['ID','match_num']].groupby('ID')
odds['match_num'] = grouped.transform(lambda x: 1+np.arange(len(x)))

In [840]:
# add keys to both odds and match data
odds.insert(len(odds.columns),'key',np.arange(len(odds)))
matches.insert(len(matches.columns),'key',np.arange(len(matches)))

Assign a match_id number to each individual match

The matches dataset includes a unique match number for each match within a tournament. The odds dataset does not. To assign such numbers, we write a function that takes as input all the records for a given tournament from both matches and odds, tries to match player names, and returns either the match number (if the player name match was successful) or nan (if not.)

<img src="./aux/example_of_name_matching_problems.png",width=500,height=500>


In [245]:


In [ ]:
g_matches = matches.groupby('tourney_id')
g_odds= odds.groupby(['ATP','fname'])

In [814]:
# create a lookup table to be able to match on rounds
m_rounds = ['R128','R64','R32','R16','QF','SF','F','RR']
o_rounds = ['1st Round','2nd Round','3rd Round','4th Round', \
            'Quarterfinals','Semifinals','The Final','Round Robin']

round_lookup_small = pd.DataFrame({'m_rounds': m_rounds[2:-1],\
                                   'o_rounds':o_rounds[0:2]+o_rounds[4:-1]})

round_lookup_medium = pd.DataFrame({'m_rounds': m_rounds[1:-1],\
                                   'o_rounds':o_rounds[0:3]+o_rounds[4:-1]})

round_lookup_large = pd.DataFrame({'m_rounds': m_rounds[0:-1],\
                                   'o_rounds':o_rounds[0:-1]})

round_lookup_RR  = pd.DataFrame({'m_rounds':m_rounds[5:],\
                                 'o_rounds':o_rounds[5:]})

In [815]:
# figure out how many discrete sizes there are
print("size in odds:  ", odds['size'].unique())
print("size in matches:  ", matches['size'].unique())


size in odds:   [ 31  27 127  95  55  47  15]
size in matches:   [ 55  27  31  47  95 127  15  23  26  64]

In [816]:
def map_rounds(x):

    cur_name = x['Round']
    t_size = x['size']

    if t_size in [27,31]:
        new_name = round_lookup_small.loc[round_lookup_small.o_rounds==cur_name,'m_rounds']
    elif t_size in [47,55]:
        new_name = round_lookup_medium.loc[round_lookup_medium.o_rounds==cur_name,'m_rounds']
    elif t_size in [95, 127]:
        new_name = round_lookup_large.loc[round_lookup_large.o_rounds==cur_name,'m_rounds']
    else:
        new_name = round_lookup_RR.loc[round_lookup_RR.o_rounds==cur_name,'m_rounds'] 

    return new_name.iloc[0]

In [817]:
# translate round indentifier appropriately
odds.insert(4,'round','TBD')
odds['round'] = odds.apply(map_rounds,axis=1).values

IP.display.display(odds[0:4])


ATP fname ID size round Location match_num Tournament Date Series ... LBL PSW PSL SJW SJL MaxW MaxL AvgW AvgL key
0 1 2010 2010-339 31 R32 brisbane 1 Brisbane International 2010-01-04 ATP250 ... 2.375 1.526 2.740 1.500 2.500 NaN NaN NaN NaN 0
1 1 2010 2010-339 31 R32 brisbane 2 Brisbane International 2010-01-04 ATP250 ... 1.571 2.140 1.813 2.250 1.571 NaN NaN NaN NaN 1
2 1 2010 2010-339 31 R32 brisbane 3 Brisbane International 2010-01-04 ATP250 ... 2.100 1.676 2.360 1.615 2.200 NaN NaN NaN NaN 2
3 1 2010 2010-339 31 R32 brisbane 4 Brisbane International 2010-01-04 ATP250 ... 1.500 2.580 1.581 2.500 1.500 NaN NaN NaN NaN 3

4 rows × 48 columns

A quick sanity check: are there the same number of matches in each set?


In [819]:
t1=odds.ID.drop_duplicates().sort_values()
t2=matches.tourney_id.drop_duplicates().sort_values()
m_sizes=matches.loc[matches.tourney_id.isin(t1),['tourney_id','size']].drop_duplicates()
o_sizes=odds.loc[odds.ID.isin(t2),['ID','size']].drop_duplicates()
comp = pd.merge(o_sizes,m_sizes,how='outer',left_on='ID',right_on='tourney_id')
print('sum of sizes of tournaments in odds: ', np.sum(o_sizes['size']))
print('sum of sizes of tournaments in matches: ', np.sum(m_sizes['size']))


sum of sizes of tournaments in odds:  18448
sum of sizes of tournaments in matches:  18448

In [825]:
matches = matches.loc[matches.tourney_id.isin(t1),:]
print("number of records in `odds`:  ", len(odds))
print("number of records in `matches`:  ", len(matches))


number of records in `odds`:   18448
number of records in `matches`:   18448

Now add player names to the mix, and try to match names, rounds, and matchid's.


In [882]:
# extract dataframe with player names split into discrete 'words'
m_players = pd.merge(matches.winner_name.str.split(pat=' ',expand=True), \
                     matches.loser_name.str.split(pat=' ',expand=True), \
                    how='inner',left_index=True, right_index=True,suffixes=('_W','_L'))  

# join tournament, round, and match identifiers
m_players = pd.merge(matches[['tourney_id','match_num', 'round','key']], m_players,\
                    how='inner',left_index=True, right_index=True).sort_values(['tourney_id','round','1_W','1_L'])

# extract dataframe with player names split into discrete 'words'
o_players = pd.merge(odds.Winner.str.split(pat=' ',expand=True), \
                     odds.Loser.str.split(pat=' ',expand=True), \
                    how='inner',left_index=True, right_index=True,suffixes=('_W','_L')) 

# join tournament and round identifiers
o_players = pd.merge(odds[['ID','round','match_num','key']], o_players,\
                    how='inner',left_index=True, right_index=True).sort_values(['ID','round','0_W','0_L'])




print("m_players: ")
IP.display.display(m_players[0:5])
print("o_players")
IP.display.display(o_players[0:5])


m_players: 
tourney_id match_num round key 0_W 1_W 2_W 3_W 4_W 0_L 1_L 2_L 3_L 4_L
54 2010-1536 55 F 54 rafael nadal None None None roger federer None None None
50 2010-1536 51 QF 50 nicolas almagro None None None jurgen melzer None None None
48 2010-1536 49 QF 48 roger federer None None None ernests gulbis None None None
49 2010-1536 50 QF 49 david ferrer None None None andy murray None None None
51 2010-1536 52 QF 51 rafael nadal None None None gael monfils None None None
o_players
ID round match_num key 0_W 1_W 2_W 3_W 4_W 0_L 1_L 2_L 3_L 4_L
7722 2010-1536 F 55 7722 nadal r. None None None federer r. None None None
7716 2010-1536 QF 49 7716 almagro n. None None None melzer j. None None None
7718 2010-1536 QF 51 7718 federer r. None None None gulbis e. None None None
7719 2010-1536 QF 52 7719 ferrer d. None None None murray a. None None None
7717 2010-1536 QF 50 7717 nadal r. None None None monfils g. None None None

Match individual matches as follows: assuming there is a match on both tournament number and round, then

  1. First try to match both winner and loser names
  2. In the compliment of the result, try to match just winner name or just loser names.
  3. Merge 1. and 2., see what's left over.

In [883]:
A = pd.merge(m_players[['tourney_id','round','key','1_W','1_L']],\
             o_players[['ID','round','key','0_W','0_L']],how='inner',\
             left_on=['tourney_id','round','1_W','1_L'],\
             right_on=['ID','round','0_W','0_L'],suffixes=['_m','_o'])

m_extras = m_players.loc[~m_players.key.isin(A.key_m),:]
o_extras = o_players.loc[~o_players.key.isin(A.key_o),:]

IP.display.display(m_extras[0:10])
IP.display.display(o_extras[0:10])


tourney_id match_num round key 0_W 1_W 2_W 3_W 4_W 0_L 1_L 2_L 3_L 4_L
36 2010-1536 37 R32 36 guillermo garcia lopez None None jo wilfried tsonga None None
28 2010-1536 29 R32 28 andy murray None None None juan ignacio chela None None
6 2010-1536 7 R64 6 juan ignacio chela None None pablo cuevas None None None
11 2010-1536 12 R64 11 eduardo schwank None None None paul henri mathieu None None
57 2010-1720 3 R32 57 olivier rochus None None None juan martin del potro None
101 2010-2276 20 R16 101 illya marchenko None None None ivan dodig None None None
100 2010-2276 19 R16 100 jurgen melzer None None None olivier rochus None None None
82 2010-2276 1 R32 82 marin cilic None None None jan hajek None None None
85 2010-2276 4 R32 85 ivo karlovic None None None antonio veic None None None
127 2010-301 15 R16 127 michael lammer None None None juan carlos ferrero None None
ID round match_num key 0_W 1_W 2_W 3_W 4_W 0_L 1_L 2_L 3_L 4_L
7707 2010-1536 R32 40 7707 garcia lopez g. None None tsonga j.w. None None None
7705 2010-1536 R32 38 7705 murray a. None None None chela j.i. None None None
7680 2010-1536 R64 13 7680 chela j.i. None None None cuevas p. None None None
7679 2010-1536 R64 12 7679 schwank e. None None None mathieu p.h. None None None
15205 2010-1720 R32 11 15205 rochus o. None None None del potro j.m. None None
2293 2010-2276 R16 18 2293 cilic m. None None None hajek j. None None None
2294 2010-2276 R16 19 2294 karlovic i. None None None veic a. None None None
2289 2010-2276 R32 14 2289 marchenko i. None None None dodig i. None None None
2291 2010-2276 R32 16 2291 melzer j. None None None rochus o. None None None
627 2010-301 R16 13 627 lammer m. None None None ferrero j.c. None None None

Is it possible for the same player to play multiple times in a single round, for a single tournament?


In [871]:
m_extras.loc[m_extras[['tourney_id','round','0_W','1_W','2_W']].duplicated(keep=False),:]


Out[871]:
tourney_id match_num round key 0_W 1_W 2_W 3_W 4_W 0_L 1_L 2_L 3_L 4_L
5117 2011-605 4 RR 5063 jo wilfried tsonga None None mardy fish None None None
5116 2011-605 3 RR 5062 jo wilfried tsonga None None rafael nadal None None None
7780 2012-605 11 RR 7685 juan martin del potro None roger federer None None None
7777 2012-605 8 RR 7682 juan martin del potro None janko tipsarevic None None None

In [899]:
def comp_str_lists(a,b):
    for i in a:
        if i in b:
            return True        
    return False

In [494]:
A = pd.merge(m_players[['tourney_id','round','key','1_W','1_L']],\
             o_players[['ID','round','key','0_W','0_L']],how='inner',\
             left_on=['tourney_id','round','1_W','1_L'],\
             right_on=['ID','round','0_W','0_L'])



B = pd.merge(m_players,o_players,how='inner',\
             left_on=['tourney_id','round','1_W','1_L'],\
             right_on=['ID','round','0_W','0_L'])

ldf = pd.merge(m_players,o_players,how='inner',left_on=['tourney_id','round','1_L'],right_on=['ID','Round2','0_L'])
IP.display.display(wdf[1:15])
IP.display.display(ldf[1:15])


tourney_id match_num round 0_W_x 1_W_x 2_W_x 3_W_x 4_W_x 0_L_x 1_L_x ... 0_W_y 1_W_y 2_W_y 3_W_y 4_W_y 0_L_y 1_L_y 2_L_y 3_L_y 4_L_y
1 2010-1536 2 R64 leonardo mayer None None None julien benneteau ... mayer l. None None None benneteau j. None None None
2 2010-1536 3 R64 stanislas wawrinka None None None marcel granollers ... wawrinka s. None None None granollers m. None None None
3 2010-1536 4 R64 mikhail youzhny None None None lukas lacko ... youzhny m. None None None lacko l. None None None
4 2010-1536 5 R64 ernests gulbis None None None albert montanes ... gulbis e. None None None montanes a. None None None
5 2010-1536 6 R64 feliciano lopez None None None lukasz kubot ... lopez f. None None None kubot l. None None None
6 2010-1536 8 R64 victor hanescu None None None igor andreev ... hanescu v. None None None andreev i. None None None
7 2010-1536 10 R64 david ferrer None None None jeremy chardy ... ferrer d. None None None chardy j. None None None
8 2010-1536 11 R64 marcos baghdatis None None None marco chiudinelli ... baghdatis m. None None None chiudinelli m. None None None
9 2010-1536 12 R64 eduardo schwank None None None paul henri ... schwank e. None None None mathieu p.h. None None None
10 2010-1536 13 R64 ivo karlovic None None None evgeny korolev ... karlovic i. None None None korolev e. None None None
11 2010-1536 14 R64 jurgen melzer None None None kevin anderson ... melzer j. None None None anderson k. None None None
12 2010-1536 15 R64 mardy fish None None None michael russell ... fish m. None None None russell m. None None None
13 2010-1536 16 R64 thomaz bellucci None None None pere riba ... bellucci t. None None None riba p. None None None
14 2010-1536 17 R64 juan monaco None None None simon greul ... monaco j. None None None greul s. None None None

14 rows × 25 columns

tourney_id match_num round 0_W_x 1_W_x 2_W_x 3_W_x 4_W_x 0_L_x 1_L_x ... 0_W_y 1_W_y 2_W_y 3_W_y 4_W_y 0_L_y 1_L_y 2_L_y 3_L_y 4_L_y
1 2010-1536 2 R64 leonardo mayer None None None julien benneteau ... mayer l. None None None benneteau j. None None None
2 2010-1536 3 R64 stanislas wawrinka None None None marcel granollers ... wawrinka s. None None None granollers m. None None None
3 2010-1536 4 R64 mikhail youzhny None None None lukas lacko ... youzhny m. None None None lacko l. None None None
4 2010-1536 5 R64 ernests gulbis None None None albert montanes ... gulbis e. None None None montanes a. None None None
5 2010-1536 6 R64 feliciano lopez None None None lukasz kubot ... lopez f. None None None kubot l. None None None
6 2010-1536 7 R64 juan ignacio chela None None pablo cuevas ... chela j.i. None None None cuevas p. None None None
7 2010-1536 8 R64 victor hanescu None None None igor andreev ... hanescu v. None None None andreev i. None None None
8 2010-1536 9 R64 daniel munoz de la nava sam querrey ... munoz-de la nava d. None querrey s. None None None
9 2010-1536 10 R64 david ferrer None None None jeremy chardy ... ferrer d. None None None chardy j. None None None
10 2010-1536 11 R64 marcos baghdatis None None None marco chiudinelli ... baghdatis m. None None None chiudinelli m. None None None
11 2010-1536 13 R64 ivo karlovic None None None evgeny korolev ... karlovic i. None None None korolev e. None None None
12 2010-1536 14 R64 jurgen melzer None None None kevin anderson ... melzer j. None None None anderson k. None None None
13 2010-1536 15 R64 mardy fish None None None michael russell ... fish m. None None None russell m. None None None
14 2010-1536 16 R64 thomaz bellucci None None None pere riba ... bellucci t. None None None riba p. None None None

14 rows × 25 columns


In [515]:
wldf = pd.merge(wdf,ldf,how='outer',on=['tourney_id','match_num'])
IP.display.display(wldf.loc[wldf.tourney_id=='2010-1536'])


tourney_id match_num round_x 0_W_x_x 1_W_x_x 2_W_x_x 3_W_x_x 4_W_x_x 0_L_x_x 1_L_x_x ... 0_W_y_y 1_W_y_y 2_W_y_y 3_W_y_y 4_W_y_y 0_L_y_y 1_L_y_y 2_L_y_y 3_L_y_y 4_L_y_y
0 2010-1536 1 R64 benjamin becker None None None carlos moya ... becker b. None None None moya c. None None None
1 2010-1536 2 R64 leonardo mayer None None None julien benneteau ... mayer l. None None None benneteau j. None None None
2 2010-1536 3 R64 stanislas wawrinka None None None marcel granollers ... wawrinka s. None None None granollers m. None None None
3 2010-1536 4 R64 mikhail youzhny None None None lukas lacko ... youzhny m. None None None lacko l. None None None
4 2010-1536 5 R64 ernests gulbis None None None albert montanes ... gulbis e. None None None montanes a. None None None
5 2010-1536 6 R64 feliciano lopez None None None lukasz kubot ... lopez f. None None None kubot l. None None None
6 2010-1536 8 R64 victor hanescu None None None igor andreev ... hanescu v. None None None andreev i. None None None
7 2010-1536 10 R64 david ferrer None None None jeremy chardy ... ferrer d. None None None chardy j. None None None
8 2010-1536 11 R64 marcos baghdatis None None None marco chiudinelli ... baghdatis m. None None None chiudinelli m. None None None
9 2010-1536 12 R64 eduardo schwank None None None paul henri ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10 2010-1536 13 R64 ivo karlovic None None None evgeny korolev ... karlovic i. None None None korolev e. None None None
11 2010-1536 14 R64 jurgen melzer None None None kevin anderson ... melzer j. None None None anderson k. None None None
12 2010-1536 15 R64 mardy fish None None None michael russell ... fish m. None None None russell m. None None None
13 2010-1536 16 R64 thomaz bellucci None None None pere riba ... bellucci t. None None None riba p. None None None
14 2010-1536 17 R64 juan monaco None None None simon greul ... monaco j. None None None greul s. None None None
15 2010-1536 18 R64 nicolas almagro None None None viktor troicki ... almagro n. None None None troicki v. None None None
16 2010-1536 20 R64 philipp petzschner None None None fabio fognini ... petzschner p. None None None fognini f. None None None
17 2010-1536 21 R64 gael monfils None None None stephane robert ... monfils g. None None None robert s. None None None
18 2010-1536 22 R64 john isner None None None christophe rochus ... isner j. None None None rochus c. None None None
19 2010-1536 23 R64 santiago giraldo None None None philipp kohlschreiber ... giraldo s. None None None kohlschreiber p. None None None
20 2010-1536 24 R64 alexandr dolgopolov None None None andreas seppi ... dolgopolov o. None None None seppi a. None None None
21 2010-1536 25 R32 roger federer None None None benjamin becker ... federer r. None None None becker b. None None None
22 2010-1536 26 R32 stanislas wawrinka None None None leonardo mayer ... wawrinka s. None None None mayer l. None None None
23 2010-1536 27 R32 ernests gulbis None None None mikhail youzhny ... gulbis e. None None None youzhny m. None None None
24 2010-1536 28 R32 feliciano lopez None None None oscar hernandez ... lopez f. None None None hernandez o. None None None
25 2010-1536 29 R32 andy murray None None None juan ignacio ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
26 2010-1536 30 R32 victor hanescu None None None daniel munoz ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27 2010-1536 31 R32 david ferrer None None None marcos baghdatis ... ferrer d. None None None baghdatis m. None None None
28 2010-1536 32 R32 marin cilic None None None eduardo schwank ... cilic m. None None None schwank e. None None None
29 2010-1536 33 R32 fernando verdasco None None None ivo karlovic ... verdasco f. None None None karlovic i. None None None
30 2010-1536 34 R32 jurgen melzer None None None mardy fish ... melzer j. None None None fish m. None None None
31 2010-1536 35 R32 juan monaco None None None thomaz bellucci ... monaco j. None None None bellucci t. None None None
32 2010-1536 36 R32 nicolas almagro None None None robin soderling ... almagro n. None None None soderling r. None None None
33 2010-1536 38 R32 gael monfils None None None philipp petzschner ... monfils g. None None None petzschner p. None None None
34 2010-1536 39 R32 john isner None None None santiago giraldo ... isner j. None None None giraldo s. None None None
35 2010-1536 40 R32 rafael nadal None None None alexandr dolgopolov ... nadal r. None None None dolgopolov o. None None None
36 2010-1536 41 R16 roger federer None None None stanislas wawrinka ... federer r. None None None wawrinka s. None None None
37 2010-1536 42 R16 ernests gulbis None None None feliciano lopez ... gulbis e. None None None lopez f. None None None
38 2010-1536 43 R16 andy murray None None None victor hanescu ... murray a. None None None hanescu v. None None None
39 2010-1536 44 R16 david ferrer None None None marin cilic ... ferrer d. None None None cilic m. None None None
40 2010-1536 45 R16 jurgen melzer None None None fernando verdasco ... melzer j. None None None verdasco f. None None None
41 2010-1536 46 R16 nicolas almagro None None None juan monaco ... almagro n. None None None monaco j. None None None
42 2010-1536 47 R16 gael monfils None None None guillermo garcia ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
43 2010-1536 48 R16 rafael nadal None None None john isner ... nadal r. None None None isner j. None None None
44 2010-1536 49 QF roger federer None None None ernests gulbis ... federer r. None None None gulbis e. None None None
45 2010-1536 50 QF david ferrer None None None andy murray ... ferrer d. None None None murray a. None None None
46 2010-1536 51 QF nicolas almagro None None None jurgen melzer ... almagro n. None None None melzer j. None None None
47 2010-1536 52 QF rafael nadal None None None gael monfils ... nadal r. None None None monfils g. None None None
48 2010-1536 53 SF roger federer None None None david ferrer ... federer r. None None None ferrer d. None None None
49 2010-1536 54 SF rafael nadal None None None nicolas almagro ... nadal r. None None None almagro n. None None None
50 2010-1536 55 F rafael nadal None None None roger federer ... nadal r. None None None federer r. None None None
17434 2010-1536 7 NaN NaN NaN NaN NaN NaN NaN NaN ... chela j.i. None None None cuevas p. None None None
17435 2010-1536 9 NaN NaN NaN NaN NaN NaN NaN NaN ... munoz-de la nava d. None querrey s. None None None

53 rows × 48 columns


In [516]:
wldf.columns


Out[516]:
Index(['tourney_id', 'match_num', 'round_x', '0_W_x_x', '1_W_x_x', '2_W_x_x',
       '3_W_x_x', '4_W_x_x', '0_L_x_x', '1_L_x_x', '2_L_x_x', '3_L_x_x',
       '4_L_x_x', 'ID_x', 'Round2_x', '0_W_y_x', '1_W_y_x', '2_W_y_x',
       '3_W_y_x', '4_W_y_x', '0_L_y_x', '1_L_y_x', '2_L_y_x', '3_L_y_x',
       '4_L_y_x', 'round_y', '0_W_x_y', '1_W_x_y', '2_W_x_y', '3_W_x_y',
       '4_W_x_y', '0_L_x_y', '1_L_x_y', '2_L_x_y', '3_L_x_y', '4_L_x_y',
       'ID_y', 'Round2_y', '0_W_y_y', '1_W_y_y', '2_W_y_y', '3_W_y_y',
       '4_W_y_y', '0_L_y_y', '1_L_y_y', '2_L_y_y', '3_L_y_y', '4_L_y_y'],
      dtype='object')

In general, we'll try to match the last name in odds with the last name in matches (eg. column 0_W with column 1_W. Long names are problematic. The following code flags names with more than two "words" and tries to build some structure to accomodate such names.


In [539]:
# identify potentially problematic names (generally, those of more than two 'words')
m_bad_winners = m_players.loc[pd.notnull(m_players.loc[:,'2_W']),\
                              ['tourney_id','match_num','0_W','1_W','2_W','3_W','4_W']].drop_duplicates(subset=['0_W','1_W','2_W','3_W','4_W'])
m_bad_losers = m_players.loc[pd.notnull(m_players.loc[:,'2_L']),\
                             ['tourney_id','match_num','0_L','1_L','2_L','3_L','4_L']].drop_duplicates(subset=['0_L','1_L','2_L','3_L','4_L'])
o_bad_winners = o_players.loc[pd.notnull(o_players.loc[:,'2_W']),\
                              ['ID','0_W','1_W','2_W','3_W','4_W']].drop_duplicates(subset=['0_W','1_W','2_W','3_W','4_W'])
o_bad_losers = o_players.loc[pd.notnull(o_players.loc[:,'2_L']),\
                             ['ID','0_L','1_L','2_L','3_L','4_L']].drop_duplicates(subset=['0_L','1_L','2_L','3_L','4_L'])

print("m_bad_winners")
IP.display.display(m_bad_winners)

print("o_bad_winners")
IP.display.display(o_bad_winners)


m_bad_winners
tourney_id match_num 0_W 1_W 2_W 3_W 4_W
6 2010-1536 7 juan ignacio chela None None
8 2010-1536 9 daniel munoz de la nava
18 2010-1536 19 guillermo garcia lopez None None
64 2010-1720 10 thiemo de bakker None None
117 2010-301 5 jose rubin statham None None
231 2010-314 6 andreas haider maurer None None
234 2010-314 9 daniel gimeno traver None None
332 2010-321 18 juan carlos ferrero None None
350 2010-328 9 paul henri mathieu None None
611 2010-360 2 reda el amrani None None
656 2010-375 20 jo wilfried tsonga None None
665 2010-402 2 yen hsun lu None None
1464 2010-451 14 younes el aynaoui None None
1638 2010-5012 10 izak van der merwe None
1716 2010-505 2 ruben ramirez hidalgo None None
1754 2010-5053 9 alessio di mauro None None
1857 2010-520 23 edouard roger vasselin None None
2356 2010-580 48 juan martin del potro None
2822 2011-2276 8 alex bogomolov jr None None
3014 2011-316 2 jonathan dasnieres de veigy None
3075 2011-321 9 cedrik marcel stebe None None
3472 2011-404 1 rik de voest None None
4111 2011-440 6 jesse huta galung None None
4464 2011-506 6 juan pablo brzezicki None None
4803 2011-560 33 rogerio dutra silva None None
4821 2011-560 51 jean rene lisnard None None
5233 2011-6242 25 pierre ludovic duclos None None
6038 2012-360 8 sergio gutierrez ferrol None None
6136 2012-403 21 roberto bautista agut None None
6867 2012-468 5 inigo cervantes huegun None None
7333 2012-540 34 kenny de schepper None None
8448 2013-316 11 jan lennard struff None None
8704 2013-352 16 pierre hugues herbert None None
8747 2013-360 12 pablo carreno busta None None
8771 2013-375 9 arnau brugues davi None None
9802 2013-506 5 diego sebastian schwartzman None None
11365 2014-360 11 roberto carballes baena None None
13606 2015-315 5 john patrick smith None None
13967 2015-360 6 arthur de greef None None
15697 2015-6718 1 daniel elahi galan riveros None
15706 2015-6718 10 alejandro gomez gb42 None None
15708 2015-6718 12 adrian menendez maceiras None None
15901 2015-741 12 taylor harry fritz None None
o_bad_winners
ID 0_W 1_W 2_W 3_W 4_W
205 2010-891 de bakker t. None None
406 2010-451 el aynaoui y. None None
589 2016-0451 munoz-de la nava d. None
753 2015-301 de schepper k. None None
833 2011-338 del potro j.m. None None
1893 2010-5012 van der merwe i. None
1913 2011-5012 de voest r. None None
1923 2011-5012 huta galung j. None None
2146 2012-505 dutra silva r. None None
2152 2012-505 del bonis f. None None
2232 2015-7161 estrella burgos v. None None
2250 2016-7161 carballes baena r. None None
5681 2010-360 el amrani r. None None
5821 2015-360 de greef a. None None
6038 2016-0360 carreno busta p. None None
6387 2016-0410 bautista agut r. None None
7019 2010-5053 di mauro a. None None
11077 2011-316 dasnieres de veigy j. None
12695 2016-0319 munoz de la nava d.
14337 2011-560 dutra da silva r. None

In [ ]:
temp=pd.merge(o_bad_winners,m_bad_winners,how="inner",left_on=['0_W','1_W'],right_on=['1_W','2_W'])
temp

In [ ]:
len(o_bad_losers)

In [ ]:
m_rounds = matches['round'].unique()
o_rounds = odds['Round'].unique()
print('m_rounds:  ', m_rounds)
print('o_rounds:  ',o_rounds)

In [ ]:
# eliminate "BR", since it corresponds to a match for which there are no odds
m_rounds = m_rounds[m_rounds!='BR']
m_rounds

In [ ]:
t=round_lookup.index[round_lookup.m_rounds=='R16'].tolist()
t

In [ ]:
def build_round_idx(tourney_id,m_data,o_data):

In [ ]:
def get_match_ID(tourney_id,m_data, o_data):
    """
    function:  get_match_ID(odds_row)
    Input:  row from dataframe odds
    Output:  match ID from corresponding tournament in matches
    """

    
    
    if (m!=n):
        print("Warning:  different number of matches in tournament {}".format(match_id))
    
    o_WSplitNames = o_data.Winner.str.split(pat=' ',expand=True)
    o_LSplitNames = o_data.Loser.str.split(pat=' ',expand=True)
    
    record = m.loc[ (matches.tourney_id == odds_row.loc['ID']) & 
                                    matches.winner_name.str.contains(Winner[0]) &
                                    matches.loser_name.str.contains(Loser[0]),'match_num']
                         
    return record

Two-dimension problem: Rank + Height


In [ ]:
# extract from MySQL database info on rank points and height for both winner and loser, store in dataframe
with engine.begin() as connection:
    rawdata = pd.read_sql_query("""SELECT winner_rank_points, loser_rank_points, winner_ht, loser_ht FROM matches \
                            WHERE tourney_date < '20150101' \
                            AND tourney_date > '20110101'
                            AND winner_rank_points IS NOT NULL \
                            AND loser_rank_points IS NOT NULL \
                            AND winner_ht IS NOT NULL \
                            AND loser_ht IS NOT NULL""", connection)

In [ ]:
# winner rank minus loser rank
rank_diff = (rawdata['winner_rank_points'] - rawdata['loser_rank_points']).values
# winner height minus loser height
ht_diff = (rawdata['winner_ht']-rawdata['loser_ht']).values
# index variable:  True if higher ranked player won, False otherwise
y = (rank_diff > 0)
# higher ranked height minus lower ranked height
rank_ht_diff = np.where(y==True, ht_diff,-ht_diff)
# final dataset with two cols: difference in rankings, high ranked height minus low ranked height  
X = np.column_stack([np.abs(rank_diff), rank_ht_diff])

In [ ]:
# for numerical well-behavedness, we need to scale and center the data
X=(X-np.mean(X,axis=0))/np.std(X,axis=0)
# prepend column of 1s to X
X=np.insert(X,0,1,axis=1)

Balance the numbers of correct and incorrect predictions via oversampling.


In [ ]:
# number of "True" minus the number of "False"
nsamps = sum(y) - sum(~y)
# the smaller class can be associated with either "True" or "False"
smallclass = bool(nsamps<0)
# sample with replacement from data of the appropriate class
samps = X[np.random.choice(np.where(y==smallclass)[0],size=nsamps),:]
# augment the old data
X1 = np.concatenate((X,samps),axis=0)
y1 = np.concatenate((y,np.zeros(nsamps,dtype=bool)))

In [ ]:
# plot the normalized data
fig, ax = plt.subplots(1,1)
ax.plot(X1[y1,1],X1[y1,2],"ro")
ax.plot(X1[~y1,1],X1[~y1,2],"bo")
ax.set_xlabel('Rank difference')
ax.set_ylabel('Height')
ax.set_title('Higher-rank-wins as a function of rank difference and height')
ax.legend(['High rank wins','Low rank wins'])

To perform the regression, we'll need to define the sigmoid function and a cost function. The former can take a scalar, vector, or matrix, and return the elementwise value of

$$ \frac{1}{1+e^{-z}} $$

In [ ]:
def sigmoid(z):
    '''
    Usage:  sigmoid(z)
    
    Description:  Computes value of sigmoid function for scalar.  
    For vector or matrix, computes values of sigmoid function for each entry.
    '''

    return 1/(1+np.exp(-z));

The cost function is designed to take a regularization parameter lambda. For a non-regularized solution, lambda can be set equal to 0. The cost function returns both a cost and the gradient for any given value of parameters $\theta$.


In [ ]:
# define a cost function
def costFunction(theta,X,y,lam):
    '''
    Computes the cost and gradient for logistic regression.
    
    Input: 
           theta (3x1 vector of parameters)
           X (nx3 matrix of feature values, first column all 1s)
           y (nx1 binary vector of outcomes, 1=higher ranked player won, 0 otherwise)
           lam (scalar:  regularization paramter)
           
    Output:  
           cost (scalar value of cost)
    '''

    # number of data points
    m = len(y) 
    # make sure vectors are column vectors
    theta = theta.reshape(-1,1)
    y = y.reshape(-1,1)
    
    # input to sigmoid function will be a column vector
    z = np.dot(X,theta)
    
    # cost function
    J = (1/m)*np.sum(np.dot(-y.transpose(),np.log(sigmoid(z))) - \
                     np.dot((1-y.transpose()),np.log(1-sigmoid(z)))) + \
                    (lam/(2*m))*np.sum(theta[1:len(theta)+1]**2);
  
    # gradient
    regterm = np.insert(theta[1:len(theta)+1],0,0)
    grad = (1/m)*np.sum((sigmoid(z) - y)*X,0) + (lam/m)*regterm
    
    
    return J, grad

Small test: make sure the cost function works.


In [ ]:
# check that cost function works
theta = np.array([1,2,3])
lam = 0
cost, grad = costFunction(theta, X1, y1*1,lam)
print("cost:", cost)
print("grad:", grad)

For diagnostic purposes, we define a callback function that will print information about the state and gradient as the optimization algorithm proceeds.


In [ ]:
def callbackF(theta):
    global NFeval
    global X1
    global y1
    global lam
    cost,grad = costFunction(theta,X1,y1*1,lam)
    print("%4d   %3.6f   %3.6f   %3.6f   %3.6f   %3.6f  %3.6f  %3.6f" % \
          (NFeval, theta[0], theta[1], theta[2], cost, grad[0], grad[1], grad[2]))  
    NFeval+=1

Finally, we run the optimization.


In [ ]:
# run optimization
NFeval = 1
#initial_theta = np.array([-5,4,3])
initial_theta = np.array([1])
#print("iter      t1          t2         t3     cost      grad1       grad2       grad3")
#res = sp.optimize.minimize(lambda t:  costFunction(t,X1,y1*1,lam), initial_theta, method='CG',\
#                           jac=True,options={'maxiter':100,'disp':True}, callback=callbackF)   
res = sp.optimize.minimize(lambda t:  costFunction(t,X1,y1*1,lam), initial_theta, method='CG',\
                           jac=True,options={'maxiter':100,'disp':True})

To see how it did, we replot the data with the logistic classifier superimposed over the top.


In [ ]:
# plot the normalized data with regression line
theta = res.x
fig, ax = plt.subplots(1,1)
ax.plot(X1[y1,1],X1[y1,2],"ro",markerfacecolor="None")
ax.plot(X1[~y1,1],X1[~y1,2],"bo",markerfacecolor="None")
xplot = np.array([-3,3])
yplot = (-1/theta[2])*(theta[1]*xplot+theta[0])
ax.plot(xplot,yplot,'g',linewidth=2)
ax.set_xlabel('Rank difference')
ax.set_ylabel('Height')
ax.set_title('Higher-rank-wins as a function of age and height')
ax.set_ylim((-5,5))

Also develop a quantitative measure of success: count the number of correct predictions, and compare to what would have been predicted by rank alone.


In [ ]:
rank_success_rate = np.sum(y)/len(y)
y_pred = (np.dot(X,theta)>0)
prediction_success_rate = np.sum(~(y^y_pred))/len(y)
print("prediction success rate: ", prediction_success_rate)
print("rank success rate: ", rank_success_rate)

Try to duplicate the above with sklearn's canned logistic algorithm


In [ ]:
t=lr.get_params()

In [ ]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C=1., solver='lbfgs')
t = lr.fit(X1,y1*1)

In [ ]:
t=lr.decision_function(X)

Comments:

The above procedure turns tennis match outcomes into training data by assigning 1s to matches in which the higher ranked opponent won.

TODO:

  1. investigate other methods of classifying matches as 1s or 0s
  2. investigate other combinations of features to use for the regression
  3. test feature choices by dividing data set into training, validation, and test sets

This section uses some basic scikit functionality to train an SVM classifier on rank and ace data. The example is again a trivial one, but defines a workflow and sets the stage for investigating more complex relations.


In [ ]:
# we'll use the SVM package in the scikit library
from sklearn import svm

After classifying the SVM classifier, we'll need some helper functions to form contour plots. These helper functions are borrowed from the scikit documentation, http://scikit-learn.org/stable/auto_examples/svm/plot_iris.html#sphx-glr-auto-examples-svm-plot-iris-py.


In [ ]:
# produce a dense grid of points in rectangle around the data
def make_meshgrid(x, y, h=.02):
    """Create a mesh of points to plot in

    Parameters
    ----------
    x: data to base x-axis meshgrid on
    y: data to base y-axis meshgrid on
    h: stepsize for meshgrid, optional

    Returns
    -------
    xx, yy : ndarray
    """
    x_min, x_max = x.min() - 1, x.max() + 1
    y_min, y_max = y.min() - 1, y.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    return xx, yy

# produce a contour plot with predicted outcomes from SVM classifier
def plot_contours(ax, clf, xx, yy, **params):
    """Plot the decision boundaries for a classifier.

    Parameters
    ----------
    ax: matplotlib axes object
    clf: a classifier
    xx: meshgrid ndarray
    yy: meshgrid ndarray
    params: dictionary of params to pass to contourf, optional
    """
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    out = ax.contourf(xx, yy, Z, **params)
    return out

We'll run our test on a slightly different set of data than last time. Here, we'll still classify matches as 1 if the higher ranked player wins and 0 otherwise, but we'll focus on age and height as our predictive features.


In [ ]:
# extract from MySQL database info on rank points and height for both winner and loser, store in dataframe
with engine.begin() as connection:
    rawdata = pd.read_sql_query("""SELECT winner_rank_points, loser_rank_points, winner_age, loser_age, winner_ht, loser_ht \
                            FROM matches \
                            WHERE tourney_date > '20170101' \
                            AND winner_rank_points IS NOT NULL \
                            AND loser_rank_points IS NOT NULL \
                            AND winner_age IS NOT NULL \
                            AND loser_age IS NOT NULL \
                            AND winner_ht IS NOT NULL \
                            AND loser_ht IS NOT NULL""", connection)

In [ ]:
# dictionary connecting `matches` tourney_name to `odds` Location
tourney_pairs =  \
[('Brisbane','Brisbane')  ,
('Chennai','Chennai')  ,
('Doha','Doha')  ,
('Auckland','Auckland')  ,
('Sydney','Sydney')  ,
('Australian Open','Melbourne')  ,
('Johannesburg','Johannesburg')  ,
('Santiago','Santiago')  ,
('Zagreb','Zagreb')  ,
('Costa Do Sauipe','Costa Do Sauipe')  ,
('Rotterdam','Rotterdam')  ,
('San Jose','San Jose')  ,
('Buenos Aires','Buenos Aires')  ,
('Marseille','Marseille')  ,
('Memphis','Memphis')  ,
('Acapulco','Acapulco')  ,
('Delray Beach','Delray Beach')  ,
('Dubai','Dubai')  ,
('Indian Wells Masters','Indian Wells')  ,
('Miami Masters','Miami')  ,
('Casablanca','Casablanca')  ,
('Houston','Houston')  ,
('Monte Carlo Masters','Monte Carlo')  ,
('Barcelona','Barcelona')  ,
('Rome Masters','Rome')  ,
('Belgrade','Belgrade')  ,
('Munich','Munich')  ,
('Estoril','Estoril')  ,
('Madrid Masters', 'Madrid')  ,
('Roland Garros','Paris')  ,
('Nice','Nice')  ,
("Queen's Club",'Queens Club')  ,
('Halle','Halle')  ,
('s-Hertogenbosch',"'s-Hertogenbosch")  ,
('Eastbourne','Eastbourne')  ,
('Wimbledon','London')  ,
('Newport','Newport')  ,
('Stuttgart','Stuttgart')  ,
('Bastad','Bastad')  ,
('Hamburg','Hamburg')  ,
('Atlanta','Atlanta')  ,
('Los Angeles','Los Angeles')  ,
('Gstaad', 'Gstaad')  ,
('Washington','Washington')  ,
('Umag', 'Umag')  ,
('Canada Masters','Toronto')  ,
('Canada Masters','Montreal')  ,
('Cincinnati Masters','Cincinnati')  ,
('US Open','New York')  ,
('Metz','Metz')  ,
('Bucharest','Bucharest')  ,
('Kuala Lumpur','Kuala Lumpur')  ,
('Bangkok', 'Bangkok')  ,
('Tokyo','Tokyo')  ,
('Beijing', 'Beijing')  ,
('Moscow','Moscow')  ,
('Shanghai Masters','Shanghai')  ,
('Montpellier','Montpellier')  ,
('Stockholm','Stockholm')  ,
('Vienna','Vienna')  ,
('St. Petersburg', 'St. Petersburg')  ,
('Basel','Basel')  ,
('Valencia','Valencia')  ,
('Paris Masters','Paris')  ,
('Tour Finals','London'),
('New Haven', 'New Haven'),
('Kitzbuhel','Kitzbuhel'),
('Winston-Salem','Winston-Salem'),
('Sao Paulo','Sao Paulo'),
('Bogota','Bogota'),
('Vina del Mar','Vina del Mar'),
('Santiago','Vina del Mar'),
('Estoril','Oeiras'),
('Power Horse Cup','Dusseldorf'),
('Rio de Janeiro','Rio de Janeiro'),
('Dusseldorf','Dusseldorf'),
('Shenzhen','Shenzhen'),
('Quito','Quito'),
('Istanbul','Istanbul'),
('Geneva','Geneva'),
('London','Queens Club'),
('Nottingham','Nottingham'),
('Sofia','Sofia'),
('Marrakech','Marrakech'),
('Los Cabos','Los Cabos'),
('Us Open','New York'),
('St.Petersburg','St. Petersburg'),
('Chengdu','Chengdu'),
('Antwerp','Antwerp'),
('London','London')]

# use dictionary keys and values as columns in a dataframe
tourney_lookup = pd.DataFrame(tourney_pairs,columns = ['m_name','o_name'])
tourney_lookup.m_name = tourney_lookup.m_name.str.lower().str.strip()
tourney_lookup.o_name = tourney_lookup.o_name.str.lower().str.strip()

In [ ]:
# this nx2 array contains the differences in ages and the differences in height
X = pd.concat([rawdata.iloc[:,2]-rawdata.iloc[:,3], \
               rawdata.iloc[:,4]-rawdata.iloc[:,5]], axis=1).values

# this nx1 binary array indicates whether the match was a "success" or a "failure", as predicted by ranking differences
y = (rawdata.iloc[:,0]-rawdata.iloc[:,1]).values > 0

In [ ]:
# for numerical well-behavedness, we need to scale and center the data
X=(X-np.mean(X,axis=0))/np.std(X,axis=0)

In [ ]:
# plot the normalized data 
fig, ax = plt.subplots(1,1)
ax.plot(X[y,0],X[y,1],"ro")
ax.plot(X[~y,0],X[~y,1],"bo")
ax.set_xlabel('Age')
ax.set_ylabel('Height')
ax.set_title('Higher-rank-wins as a function of age and height')

We'll use the scikit svm package to train an SVM classifier on this data. We'll plot the results as a contour graph.


In [ ]:
# find the SVM classifier
clf = svm.SVC()
clf.fit(X, y) 

# generate a dense grid for producing a contour plot
X0, X1 = X[:, 0], X[:, 1]
xx, yy = make_meshgrid(X0, X1)

# feed the grid into the plot_contours routinge
fig, ax = plt.subplots(1, 1)
plot_contours(ax, clf, xx, yy,
              cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xlabel('Rank points')
ax.set_ylabel('First serve %')
ax.set_xticks(())
ax.set_yticks(())
ax.set_title('SVM classifier for height/age data')

Note that these features produce a fairly mixed set of points, so they are unlikely to be highly predictive. But it is interesting to note the pocket of blue in the lower left corner: it suggests that, all other being equal, players who are younger and shorter are likely to fair worse than predicted.

IV. Betting

Here we explore the profitability of our algorithms.


In [ ]:
# name of database
db_name = "tennis"
# name of db user
username = "testuser"
# db password for db user
password = "test623"
# location of atp data files
atpfile_directory = "../data/tennis_atp-master/"
# location of odds data files
oddsfiles_directory = "../data/odds_data/"

#%%

#
# PACKAGES
#

import sqlalchemy # pandas-mysql interface library
import sqlalchemy.exc # exception handling
from   sqlalchemy import create_engine  # needed to define db interface
import glob # for file manipulation
import sys # for defining behavior under errors

#%%

#
# This cell tries to connect to the mysql database "db_name" with the login
# info supplied above.  If it succeeds, it prints out the version number of 
# mysql, if it fails, it exits gracefully.
#

# create an engine for interacting with the MySQL database
try:
    eng_str = 'mysql+mysqldb://' + username + ':' + password + '@localhost/' + db_name
    engine = create_engine(eng_str)
    connection = engine.connect()
    version = connection.execute("SELECT VERSION()")
    print("Database version : ")
    print(version.fetchone())

# report what went wrong if this fails.    
except sqlalchemy.exc.DatabaseError as e:
    reason = e.message
    print("Error %s:" % (reason))
    sys.exit(1)

# close the connection
finally:            
    if connection:    
        connection.close()
    else:
        print("Failed to create connection.")

In [ ]:
# extract from MySQL database info odds
with engine.begin() as connection:
    rawdata = pd.read_sql_query("""SELECT PSW, PSL, WRank, LRank FROM odds \
                            WHERE PSW IS NOT NULL \
                            AND PSL IS NOT NULL \
                            AND WRank IS NOT NULL \
                            AND LRank IS NOT NULL;""", connection)

In [ ]:
investment = len(rawdata)
good_call_idx = (rawdata["LRank"]-rawdata["WRank"]>0)
winner_odds = rawdata["PSW"]
gain = sum(winner_odds*good_call_idx) 
roi = gain - investment

print("total invested:  ", investment)
print("return on investment:  ", roi)

In [ ]:
np.min(winner_odds)

In [ ]:
rawdata

This work is just the begining part of this analysis. The gold standard for professional tennis match prediction is to beat the betting market. There is odds data available for many of these matches, and the proper metric for predictive success is probably percentage increase in profit.

Further work includes the following:

  1. implement a neural net
  2. systematically investigate other combinations of features, including polynomial features
  3. use join commands to extract more complex subsets of the data (i.e. court-type specific data, players of a certain origin, players with certain tournament play patterns, etc.)
  4. benchmark algorithms against one another using expected winnings ideas

Scrap


In [ ]:
n=10
X = np.random.normal(0,1,(n,2))
y = np.random.choice(a=[False, True], size=n)
X1= np.insert(X,0,1,axis=1)

In [ ]:
n0=100
n1=100
m0 = 0
m1 = 0
s0 = 1
s1 = 1
X = np.concatenate((np.random.normal(m0,s0,(n0,2)), \
                    np.concatenate((np.random.normal(0,s1,size=(n1,1)), \
                                    np.random.normal(m1,s1,size=(n1,1))),axis=1)),axis=0)
X1 = np.insert(X,0,1,axis=1)
y = np.concatenate((np.ones((n0),dtype=bool),np.zeros((n1),dtype=bool)))

In [ ]:
with engine.begin() as connection:
    rawdata = pd.read_sql_query("""SELECT DISTINCT tourney_date, winner_name, winner_rank_points, winner_rank, \
                            loser_name, loser_rank_points, loser_rank FROM matches \
                            WHERE tourney_date > '20150101' \
                            AND tourney_date < '20160101' \
                            AND tourney_name = "Wimbledon"; """, connection)

In [ ]:
with engine.begin() as connection:
    rawdata = pd.read_sql_query("""SELECT odds.Date, matches.tourney_date, odds.Tournament, \
                            SUBSTRING(odds.winner,1,LOCATE(' ',odds.winner)) AS 'Winner', \
                            SUBSTRING(odds.loser,1,LOCATE(' ',odds.loser)) AS 'Loser', \
                            odds.PSW, odds.PSL, \
                            matches.winner_rank_points, matches.loser_rank_points \
                            FROM odds \
                            INNER JOIN matches \
                            ON odds.Date = matches.tourney_date \
                            AND odds.Tournament = matches.tourney_name \
                            AND odds.Tournament = "Wimbledon" \
                            AND odds.Date > '20150101' \
                            ; """, connection)

In [ ]:
with engine.begin() as connection:
    rawdata = pd.read_sql_query("""SELECT odds.Date, matches.tourney_date, odds.Tournament, \
                            SUBSTRING(odds.winner,1,LOCATE(' ',odds.winner)), \
                            SUBSTRING(odds.loser,1,LOCATE(' ',odds.loser)) \
                            FROM odds \
                            INNER JOIN matches \
                            ON (odds.Date = matches.tourney_date \
                            AND odds.Tournament = matches.tourney_name \
                            AND odds.Tournament = "Wimbledon" \
                            AND matches.winner_name REGEXP SUBSTRING(odds.winner,1,LOCATE(' ',odds.winner)) \
                            AND matches.loser_name REGEXP SUBSTRING(odds.loser,1,LOCATE(' ',odds.loser))) \
                            ; """, connection)

In [ ]:
with engine.begin() as connection:
    rawdata = pd.read_sql_query("""SELECT (matches.winner_name REGEXP SUBSTRING(odds.winner,1,LOCATE(' ',odds.winner))) AS 'Namematch', \
                            SUBSTRING(odds.winner,1,LOCATE(' ',odds.winner)) AS 'Winner', \
                            SUBSTRING(odds.loser,1,LOCATE(' ',odds.loser)) AS 'Loser', \
                            matches.winner_name, matches.loser_name \
                            FROM matches \
                            INNER JOIN odds \
                            ON (odds.Tournament="Wimbledon" AND \
                            odds.Winner = "Tomic B.") \
                            ; """, connection)

In [ ]:
# create an engine for interacting with the MySQL database
db_name = "test"
try:
    engine.execute("CREATE DATABASE IF NOT EXISTS " + db_name) #create db
    engine.execute("USE " + db_name) # select new db

    # report what went wrong if this fails.    
except sqlalchemy.exc.DatabaseError as e:
    reason = e.message
    print("Error %s:" % (reason))
    sys.exit(1)

In [ ]:
with engine.begin() as connection:
        connection.execute("CREATE TABLE matches (tourney_name VARCHAR(256));")
        connection.execute("ALTER TABLE matches ADD COLUMN tourney_date DATE;")
        connection.execute("ALTER TABLE matches ADD COLUMN winner_name VARCHAR(256);")
        connection.execute("ALTER TABLE matches ADD COLUMN winner_rank_points SMALLINT UNSIGNED;")
        connection.execute("ALTER TABLE matches ADD COLUMN loser_name VARCHAR(256);") 
        connection.execute("ALTER TABLE matches ADD COLUMN loser_rank_points SMALLINT UNSIGNED;")

In [ ]:
with engine.begin() as connection:
        query=("""CREATE TABLE odds 
        (Tournament	VARCHAR(256),
         Date DATE,
         Winner VARCHAR(256), 
         Loser VARCHAR(256), 
         PSW DECIMAL(5,3),
         PSL DECIMAL(5,3))
         ;""")
        connection.execute(query)

In [ ]:
with engine.begin() as connection:
        query=("""CREATE TABLE t1 
        (name VARCHAR(256),
         id tinyint unsigned)
         ;""")
        connection.execute(query)

In [ ]:
with engine.begin() as connection:
        query=("""CREATE TABLE t2 
        (name VARCHAR(256),
         id tinyint unsigned)
         ;""")
        connection.execute(query)

In [ ]:
with engine.begin() as connection:
    connection.execute("""insert into t1 (name, id) values \
    ("Bob the Blob", 3), \
    ("Pedro el Gato", 4), \
    ("Josie Slatterly", 5), \
    ("Urs Burs", 4);""")

In [ ]:
with engine.begin() as connection:
    connection.execute("""insert into t2 (name, id) values \
    ("Blob B.", 3), \
    ("Gato P.", 5), \
    ("Slatterly J.", 5), \
    ("Burs U.", 3);""")

In [ ]:
with engine.begin() as connection:
    rawdata = pd.read_sql_query("""SELECT t1.name, t1.id, t2.name, t2.id, \
                            SUBSTRING(t2.name,1,LOCATE(' ',t2.name)) \
                            FROM t1 \
                            JOIN t2 \
                            ON (t1.name REGEXP SUBSTRING(t2.name,1,LOCATE(' ',t2.name))) \
                            ; """, connection)

In [ ]:
with engine.begin() as connection:
    connection.execute("""insert into matches values ('Wimbledon', '20150629', \
    'Jarkko Nieminen', 564, 'Lleyton Hewitt', 118);""")
    connection.execute("""insert into matches values ('Wimbledon', '20150629', \
    'Pierre Hugues Herbert', 353, 'Hyeon Chung', 79);""")
    connection.execute("""insert into matches values ('Wimbledon', '20160629', \
    'Bernard Tomic', 1355, 'Jan Lennard Struff', 112);""")

In [ ]:
with engine.begin() as connection:
    connection.execute("""insert into odds values ('Wimbledon', '20150629', \
    'Przysiezny M.','Ljubicic I.', 4.880, 1.230);""")
    connection.execute("""insert into odds values ('Wimbledon', '20150629', \
    'Lopez F.','Levine J.', 1.270, 4.300);""")
    connection.execute("""insert into odds values ('Wimbledon', '2016-06-21', \
    'Lu Y.H.','Zeballos H.', 1.910, 2.020);""")
    connection.execute("""insert into odds values ('Wimbledon', '20160629', \
    'Tomic B.','Struff J.L.', 1.910, 2.020);""")

In [ ]:
with engine.begin() as connection:
    connection.execute("""CREATE VIEW odds_names AS (    
    SELECT \
    SUBSTRING(t2.name,1,LOCATE(' ',t2.name)) AS 'Name' \
    FROM t2
    );""")

In [ ]:
with engine.begin() as connection:
    connection.execute("""select * from t1 inner join on (t1.name regexp odds_names.Name)""")

Bibliography:


In [ ]:
# number of "True" minus the number of "False"
nsamps = sum(y) - sum(~y)
# the smaller class can be associated with either "True" or "False"
smallclass = bool(nsamps<0)
# sample with replacement from data of the appropriate class
samps = X1[np.random.choice(np.where(y==smallclass)[0],size=nsamps)]
# augment the old data
X1 = np.concatenate((X,samps),axis=0)
y1 = np.concatenate((y,np.zeros(nsamps,dtype=bool)))

In [ ]:
import IPython
IPython.display.display(matches)

In [ ]:


In [ ]:
startdate = '20100101'
enddate = '20171231'
with engine.begin() as connection:
    matchdata = pd.read_sql_query("""SELECT tourney_date AS Date, tourney_name AS Tournament, \
                            winner_id AS WID, winner_name AS WName, winner_rank_points AS WPts, \
                            loser_id AS LID, loser_name AS LName, loser_rank_points AS LPts \
                            FROM matches \
                            WHERE tourney_date < '""" + enddate + """' \
                            AND tourney_date > '""" + startdate + """'  \
                            AND tourney_level = 'G' \
                            AND winner_rank_points IS NOT NULL \
                            AND loser_rank_points IS NOT NULL
                            ;""", connection)
    
    oddsdata = pd.read_sql_query("""SELECT Date, Tournament, \
                            Winner, WPts, Loser, LPts, \
                            PSW, PSL
                            FROM odds \
                            WHERE Date < '""" + enddate + """' \
                            AND Date > '""" + startdate + """' \
                            AND Series = 'Grand Slam' \
                            ;""", connection)