Predicting Professional Tennis Match Outcomes
Author: Carl Toews
Date: August 8, 2017
Project Description: This project explores various machine learning techniques on professional tennis data. The data set was compiled by Jeff Sackman (https://github.com/JeffSackmann/tennis_atp) and consists of the Association of Tennis Professionals (ATP) match outcomes for all major ATP matches since 1968. In all there are over half a million records. Features change slightly over the years, but by 2017 include almost 50 elements, including rank, age, seed, and match statistics for both the winner and loser.
Technical specs: I've tested this notebook on Python 3.6.2 and MySQL 5.7.16.
Outline:
I. MySQL database setup
II. One-dimensional logistic regression demo: predicting with rank differences
III. $n$-dimensional logistic regression demo: predicting with other features
III. SVM demo: age and height difference
IV. TODOs
The data on Sackman's page is in CSV format. While it's easy enough to read such data directly into a pandas data structure, one of my goals in this project is to get some practice with the Pandas-MySQL interface. To set the stage for this work, I first need to transfer all the data from CSV files to a MySQL database. Since this process is involved, I have included it in a separate notebook called setup_mysql_database.
If the data is already in a MySQL database, skip this notebook, but if it is not, or you are not sure, run it. It will first check to see whether the CSV data has already been put into a MySQL database, and if it has not, it will create such a database. In order to run the notebook, you will need the username and password of a MySQL user that has permission to create databases and tables.
Once the MySQL database is in place, you can run the following cells, which establish the connectivity needed to execute the remainder of this notebook.
In [743]:
# name of database
db_name = "tennis"
# name of db user
username = "testuser"
# db password for db user
password = "test623"
# location of atp data files
atpfile_directory = "./data/tennis_atp-master/"
# location of odds data files
oddsfiles_directory = "./data/odds_data/"
In [744]:
import sqlalchemy # pandas-mysql interface library
import sqlalchemy.exc # exception handling
from sqlalchemy import create_engine # needed to define db interface
import sys # for defining behavior under errors
from IPython.core.debugger import Tracer
#%qtconsole
In [745]:
# create an engine for interacting with the MySQL database
try:
eng_str = 'mysql+mysqldb://' + username + ':' + password + '@localhost/' + db_name
engine = create_engine(eng_str)
connection = engine.connect()
version = connection.execute("SELECT VERSION()")
print("Database version : ")
print(version.fetchone())
# report what went wrong if this fails.
except sqlalchemy.exc.DatabaseError as e:
reason = e.message
print("Error %s:" % (reason))
sys.exit(1)
# close the connection
finally:
if connection:
connection.close()
else:
print("Failed to create connection.")
We start out by performing logistic regression on rank alone, with the aim of seeing to what extent rank difference can be used as predictive tool. The development is similar to the one here:
Our work flow will involve SQL queries, pandas data frames, and numpy arrays, more or less as follows:
In [746]:
import numpy as np # numerical libraries
import scipy as sp
import pandas as pd # for data analysis
import pandas.io.sql as sql # for interfacing with MySQL database
from scipy import linalg # linear algebra libraries
from scipy import optimize
import matplotlib as mpl # a big library with plotting functionality
import matplotlib.pyplot as plt # a subset of matplotlib with most of the useful tools
import IPython as IP
%matplotlib inline
import pdb
In [ ]:
# extract from MySQL database info on rank points and height for both winner and loser, store in dataframe
with engine.begin() as connection:
rawdata = pd.read_sql_query("""SELECT winner_rank_points, loser_rank_points FROM matches \
WHERE tourney_date < '20150101' \
AND tourney_date > '20110101'
AND winner_rank_points IS NOT NULL \
AND loser_rank_points IS NOT NULL""", connection)
If the winner had a higher rank than the loser, the 'rank prediction' was considered a 'success', otherwise it is a 'failure'. For analysis, we store only the absolute value of the rank difference, along with the outcome of the match. We also scale the data for numerical well-behavedness.
In [ ]:
# winner rank minus loser rank
rank_diff = (rawdata['winner_rank_points'] - rawdata['loser_rank_points']).values
# index variable: True if higher ranked player won, False otherwise
y = (rank_diff > 0)
# final dataset with two cols: difference in rankings, high ranked height minus low ranked height
X = np.abs(rank_diff)
# for numerical well-behavedness, we need to scale and center the data
#X1=(X-np.mean(X,0))/np.std(X,axis=0)
X=X/np.std(X,axis=0)
In [ ]:
def sigmoid(z):
'''
Usage: sigmoid(z)
Description: Computes value of sigmoid function for scalar.
For vector or matrix, computes values of sigmoid function for each entry.
'''
return 1/(1+np.exp(-z));
The cost function is designed to take a regularization parameter lambda. For a non-regularized solution, lambda can be set equal to 0. The cost function returns both a cost and the gradient for any given value of parameters $\theta$.
In [ ]:
# define a cost function
def costFunction(theta,X,y,lam,offset=True):
'''
Computes the cost and gradient for logistic regression.
Input:
theta (mx1 numpy array of parameters)
X (nxm numpy array of feature values, first column all 1s)
y (nx1 boolean array of outcomes, 1=higher ranked player won, 0 otherwise)
lam (scalar: regularization paramter)
offset (bool: True of first element of theta represents a translation, False otherwise)
Output:
cost (scalar value of cost)
'''
# number of data points
n = len(y)
#number of parameters
m = len(theta)
# make sure vectors are column vectors for use of "np.dot"
theta = theta.reshape(-1,1)
y = y.reshape(-1,1)
X = X.reshape(-1,1) if m==1 else X
# input to sigmoid function will be a column vector
z = np.dot(X,theta) if len(theta) > 1 else X*theta
# cost function
regterms = np.arange(1,m) if offset else np.arange(0,m)
#Tracer()()
J = (1/n)*(-np.dot(y.transpose(),np.log(sigmoid(z))) - \
np.dot((1-y.transpose()),np.log(1-sigmoid(z))) + \
(lam/(2))*np.sum(theta[regterms]**2))
# gradient
#Tracer()()
reggrad = np.insert(theta[regterms],0,0)
grad = (1/n)*np.sum((sigmoid(z) - y)*X,0) + (lam/n)*reggrad
return np.squeeze(J), np.squeeze(grad)
Small test: make sure the cost function works.
In [ ]:
# check that cost function works
theta = np.array([1.0])
lam = 0
cost, grad = costFunction(theta, X, y*1,lam)
print("cost:", cost)
print("grad:", grad)
For diagnostic purposes, we define a callback
function that will print information about the state and gradient as the optimization algorithm proceeds.
In [ ]:
def callbackF(theta):
global NFeval
global X
global y
global lam
cost,grad = costFunction(theta,X,y*1,lam)
print("%4d %3.6f %3.6f %3.6f" % \
(NFeval, theta, cost, grad))
NFeval+=1
In [ ]:
NFeval = 1
initial_theta = np.array([.1])
print("iter theta cost grad ")
res = sp.optimize.minimize(lambda t: costFunction(t,X,y*1,lam), initial_theta, method='CG',\
jac=True,options={'maxiter':100,'disp':True},callback=callbackF)
In [ ]:
# histogram empirical success probabilities by
hist, bin_edges = np.histogram(X,bins=100)
p = [np.sum(y[np.where((X>=bin_edges[i]) & (X<bin_edges[i+1]))[0]])/hist[i] for i in np.arange(len(bin_edges)-1)]
bar_pos = np.arange(len(p))
bar_width = np.diff(bin_edges)
plt.bar(bin_edges[0:-1], p, width=bar_width, align='edge', alpha=0.5)
r = np.arange(X.min(),X.max(),.1)
#s = 1/(1+np.exp(-res.x*r))
s = 1/(1+np.exp(-res.x*r))
plt.plot(r,s)
We'll evaluate the success of our methods by comparing our predictions to those implied by the betting markets. In order to do this, we need to establish a 1-1 correspondence between matches in the odds
database and matches in the matches
database.
Data wrangling
Matching is not straightforward, unfortunately. The odds
data contains the variables Location
and Tournament
, while the matches
data contains tourney_name
. But the names are not consistent across these datasets: for example, the tournament "French Open" in odds
is called "Roland Garros" in matches
. The Location
variable in odds
is generally a pretty close match to the tourney_name
in matches
, but not always. Moreoever, the date
variables differ slightly between these datasets: sometimes, every match in a tournament is pegged with a single date (generally the start date), sometimes each match has the date it was actually played.
The following code attempt to work through these and related issues. There is a fair bit of ugly data wrangling involved: my solution is ultimately to build a lookup table connecting Location
in odds
to tourney_name
in matches. The part of the table where there is an actual match between these variables is easy to build; the other part involves some manual inspection of the underlying CSV files. The code builds the easy part of the table, and then flags which matches still need to be identified manually. I focus on data in the 2010-2016 range.
Get the data from the appropriate timefame
In [803]:
# focus on most recent data; exclude Davis Cup stuff
startdate = '20100101'
enddate = '20161231'
with engine.begin() as connection:
odds = pd.read_sql_query("""SELECT * FROM odds \
WHERE DATE >= '""" + startdate + """' \
AND DATE <= '""" + enddate + """';""", connection)
with engine.begin() as connection:
matches = pd.read_sql_query("""SELECT * FROM matches \
WHERE tourney_date >= '""" + startdate + """' \
AND tourney_date <= '""" + enddate + """' \
AND tourney_name NOT LIKE 'Davis%%';""", connection)
# view results
IP.display.display(odds[0:3])
IP.display.display(matches[0:3])
Tidy up the strings (strip whitespace, convert to lowercase, replace dashes by space)
In [879]:
odds[['Location','Winner','Loser']] = \
odds[['Location','Winner','Loser']].apply(lambda x: x.str.strip().str.lower().str.replace('-',' '),axis=1)
matches[['tourney_name','winner_name','loser_name']] = \
matches[['tourney_name','winner_name','loser_name']].apply(lambda x: x.str.strip().str.lower().str.replace('-',' '),axis=1)
Group both odds
and matches
by unique tournament
In [805]:
g_matches = matches.groupby('tourney_id')
g_odds= odds.groupby(['ATP','fname'])
Extract features to assist in "tourney-matching".
In [806]:
def extract_odds_features(group):
sizes = len(group)
min_date = group['Date'].min()
max_date = group['Date'].max()
location = group['Location'].unique()[0]
return pd.Series({'size': sizes,'min_date':min_date, 'max_date':max_date,'location':location})
def extract_matches_features(group):
sizes = len(group)
min_date = group['tourney_date'].min()
max_date = group['tourney_date'].max()
tourney_name = group['tourney_name'].unique()[0]
return pd.Series({'size': sizes,'min_date':min_date, 'max_date':max_date,'tourney_name':tourney_name})
g_odds = g_odds.apply(extract_odds_features).reset_index()
g_matches = g_matches.apply(extract_matches_features).reset_index()
Define lookup-table connecting Location
in odds
to tourney_name
in matches
In [807]:
tourney_lookup = pd.read_csv('tourney_lookup.csv')
print("Snapshot of lookup table:")
IP.display.display(tourney_lookup.sort_values('o_name')[15:25])
Define a function that will take as input a set of features from one tournament in odds
, and try to match it to one tournament in matches
.
In [808]:
def get_tourney_ID(o_row):
"""
function: get_tourney_ID(o_row)
Input: row from dataframe g_odds
Output: a Series object with two elements: 1) a match ID,
and 2), a flag of True if the sizes of the two tournmanets are identical
"""
# calculate the diffence in start/stop dates between this tournament and those in `matches`.
min_date_delta = np.abs(g_matches['min_date'] - o_row['min_date']).apply(lambda x: x.days)
max_date_delta = np.abs(g_matches['max_date'] - o_row['max_date']).apply(lambda x: x.days)
# find a list of candidate tournament names, based on lookup table
mtchs = (tourney_lookup['o_name']==o_row['location'])
if sum(mtchs)>0:
m_name = tourney_lookup.loc[mtchs,'m_name']
else:
print('no match found for record {}'.format(o_row['location']))
return ['Nan','Nan']
# the "right" tournament has the right name, and reasonable close start or stop dates
idx = ((min_date_delta <=3) | (max_date_delta <=1)) & (g_matches['tourney_name'].isin(m_name))
record = g_matches.loc[idx,'tourney_id']
#print("ATP {}, {}, {}".format(o_row['ATP'],o_row['location'],o_row['fname']))
#pdb.set_trace()
# if there are no matches, print some diagnostic information and don't assign a match
if len(record)<1:
print("Warning: no match found for `odds` match {}, year {}".format(o_row.ATP, o_row.fname))
print("min date delta: {}, max date delta: {}, g_matches: {}".format(np.min(min_date_delta), \
np.min(max_date_delta), \
g_matches.loc[g_matches['tourney_name'].isin(m_name),'tourney_name']))
return pd.Series({'ID':'None','size':'NA'})
# if there are too many matches, print a warning and don't assign a match.
elif (len(record)>1):
print("Warning: multiple matches found for `odds` match {}".format(o_row.ATP))
return pd.Series({'ID':'Multiple','size':'NA'})
# otherwise, assign a match, and check if the sizes of the matches are consistent (a good double-check)
else:
size_flag = (g_matches.loc[idx,'size']==o_row['size'])
return pd.Series({'ID':record.iloc[0],'size':size_flag.iloc[0]})
Perform the matches.
In [809]:
# add columns to g_odds to hold match ID and also info about size-correspondence
g_odds.insert(len(g_odds.columns),'ID','None')
g_odds.insert(len(g_odds.columns),'sizes_match','NA')
# perform the match
g_odds[['ID','sizes_match']] = g_odds.apply(get_tourney_ID,axis=1).values
Merge match numbers back into bigger odds
table if the previous routine produced no errors.
In [810]:
# merge on ATP and year
if sum(g_odds['sizes_match']==True) == len(g_odds):
odds = pd.merge(g_odds[['ATP','fname','ID','size']],odds,how='inner',on=['ATP','fname'])
else:
print("At least one tournament in `odds` is matched to a tournament in `matches` of a different size.")
In [811]:
matches = pd.merge(g_matches[['tourney_id','size']],matches,how='inner',on=['tourney_id'])
In [812]:
# match numbers to odds
odds.insert(5,'match_num',0)
grouped = odds[['ID','match_num']].groupby('ID')
odds['match_num'] = grouped.transform(lambda x: 1+np.arange(len(x)))
In [840]:
# add keys to both odds and match data
odds.insert(len(odds.columns),'key',np.arange(len(odds)))
matches.insert(len(matches.columns),'key',np.arange(len(matches)))
Assign a match_id number to each individual match
The matches
dataset includes a unique match number for each match within a tournament. The odds
dataset does not. To assign such numbers, we write a function that takes as input all the records for a given tournament from both matches
and odds
, tries to match player names, and returns either the match number (if the player name match was successful) or nan
(if not.)
<img src="./aux/example_of_name_matching_problems.png",width=500,height=500>
In [245]:
In [ ]:
g_matches = matches.groupby('tourney_id')
g_odds= odds.groupby(['ATP','fname'])
In [814]:
# create a lookup table to be able to match on rounds
m_rounds = ['R128','R64','R32','R16','QF','SF','F','RR']
o_rounds = ['1st Round','2nd Round','3rd Round','4th Round', \
'Quarterfinals','Semifinals','The Final','Round Robin']
round_lookup_small = pd.DataFrame({'m_rounds': m_rounds[2:-1],\
'o_rounds':o_rounds[0:2]+o_rounds[4:-1]})
round_lookup_medium = pd.DataFrame({'m_rounds': m_rounds[1:-1],\
'o_rounds':o_rounds[0:3]+o_rounds[4:-1]})
round_lookup_large = pd.DataFrame({'m_rounds': m_rounds[0:-1],\
'o_rounds':o_rounds[0:-1]})
round_lookup_RR = pd.DataFrame({'m_rounds':m_rounds[5:],\
'o_rounds':o_rounds[5:]})
In [815]:
# figure out how many discrete sizes there are
print("size in odds: ", odds['size'].unique())
print("size in matches: ", matches['size'].unique())
In [816]:
def map_rounds(x):
cur_name = x['Round']
t_size = x['size']
if t_size in [27,31]:
new_name = round_lookup_small.loc[round_lookup_small.o_rounds==cur_name,'m_rounds']
elif t_size in [47,55]:
new_name = round_lookup_medium.loc[round_lookup_medium.o_rounds==cur_name,'m_rounds']
elif t_size in [95, 127]:
new_name = round_lookup_large.loc[round_lookup_large.o_rounds==cur_name,'m_rounds']
else:
new_name = round_lookup_RR.loc[round_lookup_RR.o_rounds==cur_name,'m_rounds']
return new_name.iloc[0]
In [817]:
# translate round indentifier appropriately
odds.insert(4,'round','TBD')
odds['round'] = odds.apply(map_rounds,axis=1).values
IP.display.display(odds[0:4])
A quick sanity check: are there the same number of matches in each set?
In [819]:
t1=odds.ID.drop_duplicates().sort_values()
t2=matches.tourney_id.drop_duplicates().sort_values()
m_sizes=matches.loc[matches.tourney_id.isin(t1),['tourney_id','size']].drop_duplicates()
o_sizes=odds.loc[odds.ID.isin(t2),['ID','size']].drop_duplicates()
comp = pd.merge(o_sizes,m_sizes,how='outer',left_on='ID',right_on='tourney_id')
print('sum of sizes of tournaments in odds: ', np.sum(o_sizes['size']))
print('sum of sizes of tournaments in matches: ', np.sum(m_sizes['size']))
In [825]:
matches = matches.loc[matches.tourney_id.isin(t1),:]
print("number of records in `odds`: ", len(odds))
print("number of records in `matches`: ", len(matches))
Now add player names to the mix, and try to match names, rounds, and matchid's.
In [882]:
# extract dataframe with player names split into discrete 'words'
m_players = pd.merge(matches.winner_name.str.split(pat=' ',expand=True), \
matches.loser_name.str.split(pat=' ',expand=True), \
how='inner',left_index=True, right_index=True,suffixes=('_W','_L'))
# join tournament, round, and match identifiers
m_players = pd.merge(matches[['tourney_id','match_num', 'round','key']], m_players,\
how='inner',left_index=True, right_index=True).sort_values(['tourney_id','round','1_W','1_L'])
# extract dataframe with player names split into discrete 'words'
o_players = pd.merge(odds.Winner.str.split(pat=' ',expand=True), \
odds.Loser.str.split(pat=' ',expand=True), \
how='inner',left_index=True, right_index=True,suffixes=('_W','_L'))
# join tournament and round identifiers
o_players = pd.merge(odds[['ID','round','match_num','key']], o_players,\
how='inner',left_index=True, right_index=True).sort_values(['ID','round','0_W','0_L'])
print("m_players: ")
IP.display.display(m_players[0:5])
print("o_players")
IP.display.display(o_players[0:5])
Match individual matches as follows: assuming there is a match on both tournament number and round, then
In [883]:
A = pd.merge(m_players[['tourney_id','round','key','1_W','1_L']],\
o_players[['ID','round','key','0_W','0_L']],how='inner',\
left_on=['tourney_id','round','1_W','1_L'],\
right_on=['ID','round','0_W','0_L'],suffixes=['_m','_o'])
m_extras = m_players.loc[~m_players.key.isin(A.key_m),:]
o_extras = o_players.loc[~o_players.key.isin(A.key_o),:]
IP.display.display(m_extras[0:10])
IP.display.display(o_extras[0:10])
Is it possible for the same player to play multiple times in a single round, for a single tournament?
In [871]:
m_extras.loc[m_extras[['tourney_id','round','0_W','1_W','2_W']].duplicated(keep=False),:]
Out[871]:
In [899]:
def comp_str_lists(a,b):
for i in a:
if i in b:
return True
return False
In [494]:
A = pd.merge(m_players[['tourney_id','round','key','1_W','1_L']],\
o_players[['ID','round','key','0_W','0_L']],how='inner',\
left_on=['tourney_id','round','1_W','1_L'],\
right_on=['ID','round','0_W','0_L'])
B = pd.merge(m_players,o_players,how='inner',\
left_on=['tourney_id','round','1_W','1_L'],\
right_on=['ID','round','0_W','0_L'])
ldf = pd.merge(m_players,o_players,how='inner',left_on=['tourney_id','round','1_L'],right_on=['ID','Round2','0_L'])
IP.display.display(wdf[1:15])
IP.display.display(ldf[1:15])
In [515]:
wldf = pd.merge(wdf,ldf,how='outer',on=['tourney_id','match_num'])
IP.display.display(wldf.loc[wldf.tourney_id=='2010-1536'])
In [516]:
wldf.columns
Out[516]:
In general, we'll try to match the last name in odds
with the last name in matches
(eg. column 0_W
with column 1_W
. Long names are problematic. The following code flags names with more than two "words" and tries to build some structure to accomodate such names.
In [539]:
# identify potentially problematic names (generally, those of more than two 'words')
m_bad_winners = m_players.loc[pd.notnull(m_players.loc[:,'2_W']),\
['tourney_id','match_num','0_W','1_W','2_W','3_W','4_W']].drop_duplicates(subset=['0_W','1_W','2_W','3_W','4_W'])
m_bad_losers = m_players.loc[pd.notnull(m_players.loc[:,'2_L']),\
['tourney_id','match_num','0_L','1_L','2_L','3_L','4_L']].drop_duplicates(subset=['0_L','1_L','2_L','3_L','4_L'])
o_bad_winners = o_players.loc[pd.notnull(o_players.loc[:,'2_W']),\
['ID','0_W','1_W','2_W','3_W','4_W']].drop_duplicates(subset=['0_W','1_W','2_W','3_W','4_W'])
o_bad_losers = o_players.loc[pd.notnull(o_players.loc[:,'2_L']),\
['ID','0_L','1_L','2_L','3_L','4_L']].drop_duplicates(subset=['0_L','1_L','2_L','3_L','4_L'])
print("m_bad_winners")
IP.display.display(m_bad_winners)
print("o_bad_winners")
IP.display.display(o_bad_winners)
In [ ]:
temp=pd.merge(o_bad_winners,m_bad_winners,how="inner",left_on=['0_W','1_W'],right_on=['1_W','2_W'])
temp
In [ ]:
len(o_bad_losers)
In [ ]:
m_rounds = matches['round'].unique()
o_rounds = odds['Round'].unique()
print('m_rounds: ', m_rounds)
print('o_rounds: ',o_rounds)
In [ ]:
# eliminate "BR", since it corresponds to a match for which there are no odds
m_rounds = m_rounds[m_rounds!='BR']
m_rounds
In [ ]:
t=round_lookup.index[round_lookup.m_rounds=='R16'].tolist()
t
In [ ]:
def build_round_idx(tourney_id,m_data,o_data):
In [ ]:
def get_match_ID(tourney_id,m_data, o_data):
"""
function: get_match_ID(odds_row)
Input: row from dataframe odds
Output: match ID from corresponding tournament in matches
"""
if (m!=n):
print("Warning: different number of matches in tournament {}".format(match_id))
o_WSplitNames = o_data.Winner.str.split(pat=' ',expand=True)
o_LSplitNames = o_data.Loser.str.split(pat=' ',expand=True)
record = m.loc[ (matches.tourney_id == odds_row.loc['ID']) &
matches.winner_name.str.contains(Winner[0]) &
matches.loser_name.str.contains(Loser[0]),'match_num']
return record
Two-dimension problem: Rank + Height
In [ ]:
# extract from MySQL database info on rank points and height for both winner and loser, store in dataframe
with engine.begin() as connection:
rawdata = pd.read_sql_query("""SELECT winner_rank_points, loser_rank_points, winner_ht, loser_ht FROM matches \
WHERE tourney_date < '20150101' \
AND tourney_date > '20110101'
AND winner_rank_points IS NOT NULL \
AND loser_rank_points IS NOT NULL \
AND winner_ht IS NOT NULL \
AND loser_ht IS NOT NULL""", connection)
In [ ]:
# winner rank minus loser rank
rank_diff = (rawdata['winner_rank_points'] - rawdata['loser_rank_points']).values
# winner height minus loser height
ht_diff = (rawdata['winner_ht']-rawdata['loser_ht']).values
# index variable: True if higher ranked player won, False otherwise
y = (rank_diff > 0)
# higher ranked height minus lower ranked height
rank_ht_diff = np.where(y==True, ht_diff,-ht_diff)
# final dataset with two cols: difference in rankings, high ranked height minus low ranked height
X = np.column_stack([np.abs(rank_diff), rank_ht_diff])
In [ ]:
# for numerical well-behavedness, we need to scale and center the data
X=(X-np.mean(X,axis=0))/np.std(X,axis=0)
# prepend column of 1s to X
X=np.insert(X,0,1,axis=1)
Balance the numbers of correct and incorrect predictions via oversampling.
In [ ]:
# number of "True" minus the number of "False"
nsamps = sum(y) - sum(~y)
# the smaller class can be associated with either "True" or "False"
smallclass = bool(nsamps<0)
# sample with replacement from data of the appropriate class
samps = X[np.random.choice(np.where(y==smallclass)[0],size=nsamps),:]
# augment the old data
X1 = np.concatenate((X,samps),axis=0)
y1 = np.concatenate((y,np.zeros(nsamps,dtype=bool)))
In [ ]:
# plot the normalized data
fig, ax = plt.subplots(1,1)
ax.plot(X1[y1,1],X1[y1,2],"ro")
ax.plot(X1[~y1,1],X1[~y1,2],"bo")
ax.set_xlabel('Rank difference')
ax.set_ylabel('Height')
ax.set_title('Higher-rank-wins as a function of rank difference and height')
ax.legend(['High rank wins','Low rank wins'])
To perform the regression, we'll need to define the sigmoid function and a cost function. The former can take a scalar, vector, or matrix, and return the elementwise value of
$$ \frac{1}{1+e^{-z}} $$
In [ ]:
def sigmoid(z):
'''
Usage: sigmoid(z)
Description: Computes value of sigmoid function for scalar.
For vector or matrix, computes values of sigmoid function for each entry.
'''
return 1/(1+np.exp(-z));
The cost function is designed to take a regularization parameter lambda. For a non-regularized solution, lambda can be set equal to 0. The cost function returns both a cost and the gradient for any given value of parameters $\theta$.
In [ ]:
# define a cost function
def costFunction(theta,X,y,lam):
'''
Computes the cost and gradient for logistic regression.
Input:
theta (3x1 vector of parameters)
X (nx3 matrix of feature values, first column all 1s)
y (nx1 binary vector of outcomes, 1=higher ranked player won, 0 otherwise)
lam (scalar: regularization paramter)
Output:
cost (scalar value of cost)
'''
# number of data points
m = len(y)
# make sure vectors are column vectors
theta = theta.reshape(-1,1)
y = y.reshape(-1,1)
# input to sigmoid function will be a column vector
z = np.dot(X,theta)
# cost function
J = (1/m)*np.sum(np.dot(-y.transpose(),np.log(sigmoid(z))) - \
np.dot((1-y.transpose()),np.log(1-sigmoid(z)))) + \
(lam/(2*m))*np.sum(theta[1:len(theta)+1]**2);
# gradient
regterm = np.insert(theta[1:len(theta)+1],0,0)
grad = (1/m)*np.sum((sigmoid(z) - y)*X,0) + (lam/m)*regterm
return J, grad
Small test: make sure the cost function works.
In [ ]:
# check that cost function works
theta = np.array([1,2,3])
lam = 0
cost, grad = costFunction(theta, X1, y1*1,lam)
print("cost:", cost)
print("grad:", grad)
For diagnostic purposes, we define a callback
function that will print information about the state and gradient as the optimization algorithm proceeds.
In [ ]:
def callbackF(theta):
global NFeval
global X1
global y1
global lam
cost,grad = costFunction(theta,X1,y1*1,lam)
print("%4d %3.6f %3.6f %3.6f %3.6f %3.6f %3.6f %3.6f" % \
(NFeval, theta[0], theta[1], theta[2], cost, grad[0], grad[1], grad[2]))
NFeval+=1
Finally, we run the optimization.
In [ ]:
# run optimization
NFeval = 1
#initial_theta = np.array([-5,4,3])
initial_theta = np.array([1])
#print("iter t1 t2 t3 cost grad1 grad2 grad3")
#res = sp.optimize.minimize(lambda t: costFunction(t,X1,y1*1,lam), initial_theta, method='CG',\
# jac=True,options={'maxiter':100,'disp':True}, callback=callbackF)
res = sp.optimize.minimize(lambda t: costFunction(t,X1,y1*1,lam), initial_theta, method='CG',\
jac=True,options={'maxiter':100,'disp':True})
To see how it did, we replot the data with the logistic classifier superimposed over the top.
In [ ]:
# plot the normalized data with regression line
theta = res.x
fig, ax = plt.subplots(1,1)
ax.plot(X1[y1,1],X1[y1,2],"ro",markerfacecolor="None")
ax.plot(X1[~y1,1],X1[~y1,2],"bo",markerfacecolor="None")
xplot = np.array([-3,3])
yplot = (-1/theta[2])*(theta[1]*xplot+theta[0])
ax.plot(xplot,yplot,'g',linewidth=2)
ax.set_xlabel('Rank difference')
ax.set_ylabel('Height')
ax.set_title('Higher-rank-wins as a function of age and height')
ax.set_ylim((-5,5))
Also develop a quantitative measure of success: count the number of correct predictions, and compare to what would have been predicted by rank alone.
In [ ]:
rank_success_rate = np.sum(y)/len(y)
y_pred = (np.dot(X,theta)>0)
prediction_success_rate = np.sum(~(y^y_pred))/len(y)
print("prediction success rate: ", prediction_success_rate)
print("rank success rate: ", rank_success_rate)
Try to duplicate the above with sklearn's canned logistic algorithm
In [ ]:
t=lr.get_params()
In [ ]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C=1., solver='lbfgs')
t = lr.fit(X1,y1*1)
In [ ]:
t=lr.decision_function(X)
Comments:
The above procedure turns tennis match outcomes into training data by assigning 1s to matches in which the higher ranked opponent won.
TODO:
This section uses some basic scikit
functionality to train an SVM classifier on rank and ace data. The example is again a trivial one, but defines a workflow and sets the stage for investigating more complex relations.
In [ ]:
# we'll use the SVM package in the scikit library
from sklearn import svm
After classifying the SVM classifier, we'll need some helper functions to form contour plots. These helper functions are borrowed from the scikit
documentation, http://scikit-learn.org/stable/auto_examples/svm/plot_iris.html#sphx-glr-auto-examples-svm-plot-iris-py.
In [ ]:
# produce a dense grid of points in rectangle around the data
def make_meshgrid(x, y, h=.02):
"""Create a mesh of points to plot in
Parameters
----------
x: data to base x-axis meshgrid on
y: data to base y-axis meshgrid on
h: stepsize for meshgrid, optional
Returns
-------
xx, yy : ndarray
"""
x_min, x_max = x.min() - 1, x.max() + 1
y_min, y_max = y.min() - 1, y.max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
return xx, yy
# produce a contour plot with predicted outcomes from SVM classifier
def plot_contours(ax, clf, xx, yy, **params):
"""Plot the decision boundaries for a classifier.
Parameters
----------
ax: matplotlib axes object
clf: a classifier
xx: meshgrid ndarray
yy: meshgrid ndarray
params: dictionary of params to pass to contourf, optional
"""
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
out = ax.contourf(xx, yy, Z, **params)
return out
We'll run our test on a slightly different set of data than last time. Here, we'll still classify matches as 1
if the higher ranked player wins and 0
otherwise, but we'll focus on age and height as our predictive features.
In [ ]:
# extract from MySQL database info on rank points and height for both winner and loser, store in dataframe
with engine.begin() as connection:
rawdata = pd.read_sql_query("""SELECT winner_rank_points, loser_rank_points, winner_age, loser_age, winner_ht, loser_ht \
FROM matches \
WHERE tourney_date > '20170101' \
AND winner_rank_points IS NOT NULL \
AND loser_rank_points IS NOT NULL \
AND winner_age IS NOT NULL \
AND loser_age IS NOT NULL \
AND winner_ht IS NOT NULL \
AND loser_ht IS NOT NULL""", connection)
In [ ]:
# dictionary connecting `matches` tourney_name to `odds` Location
tourney_pairs = \
[('Brisbane','Brisbane') ,
('Chennai','Chennai') ,
('Doha','Doha') ,
('Auckland','Auckland') ,
('Sydney','Sydney') ,
('Australian Open','Melbourne') ,
('Johannesburg','Johannesburg') ,
('Santiago','Santiago') ,
('Zagreb','Zagreb') ,
('Costa Do Sauipe','Costa Do Sauipe') ,
('Rotterdam','Rotterdam') ,
('San Jose','San Jose') ,
('Buenos Aires','Buenos Aires') ,
('Marseille','Marseille') ,
('Memphis','Memphis') ,
('Acapulco','Acapulco') ,
('Delray Beach','Delray Beach') ,
('Dubai','Dubai') ,
('Indian Wells Masters','Indian Wells') ,
('Miami Masters','Miami') ,
('Casablanca','Casablanca') ,
('Houston','Houston') ,
('Monte Carlo Masters','Monte Carlo') ,
('Barcelona','Barcelona') ,
('Rome Masters','Rome') ,
('Belgrade','Belgrade') ,
('Munich','Munich') ,
('Estoril','Estoril') ,
('Madrid Masters', 'Madrid') ,
('Roland Garros','Paris') ,
('Nice','Nice') ,
("Queen's Club",'Queens Club') ,
('Halle','Halle') ,
('s-Hertogenbosch',"'s-Hertogenbosch") ,
('Eastbourne','Eastbourne') ,
('Wimbledon','London') ,
('Newport','Newport') ,
('Stuttgart','Stuttgart') ,
('Bastad','Bastad') ,
('Hamburg','Hamburg') ,
('Atlanta','Atlanta') ,
('Los Angeles','Los Angeles') ,
('Gstaad', 'Gstaad') ,
('Washington','Washington') ,
('Umag', 'Umag') ,
('Canada Masters','Toronto') ,
('Canada Masters','Montreal') ,
('Cincinnati Masters','Cincinnati') ,
('US Open','New York') ,
('Metz','Metz') ,
('Bucharest','Bucharest') ,
('Kuala Lumpur','Kuala Lumpur') ,
('Bangkok', 'Bangkok') ,
('Tokyo','Tokyo') ,
('Beijing', 'Beijing') ,
('Moscow','Moscow') ,
('Shanghai Masters','Shanghai') ,
('Montpellier','Montpellier') ,
('Stockholm','Stockholm') ,
('Vienna','Vienna') ,
('St. Petersburg', 'St. Petersburg') ,
('Basel','Basel') ,
('Valencia','Valencia') ,
('Paris Masters','Paris') ,
('Tour Finals','London'),
('New Haven', 'New Haven'),
('Kitzbuhel','Kitzbuhel'),
('Winston-Salem','Winston-Salem'),
('Sao Paulo','Sao Paulo'),
('Bogota','Bogota'),
('Vina del Mar','Vina del Mar'),
('Santiago','Vina del Mar'),
('Estoril','Oeiras'),
('Power Horse Cup','Dusseldorf'),
('Rio de Janeiro','Rio de Janeiro'),
('Dusseldorf','Dusseldorf'),
('Shenzhen','Shenzhen'),
('Quito','Quito'),
('Istanbul','Istanbul'),
('Geneva','Geneva'),
('London','Queens Club'),
('Nottingham','Nottingham'),
('Sofia','Sofia'),
('Marrakech','Marrakech'),
('Los Cabos','Los Cabos'),
('Us Open','New York'),
('St.Petersburg','St. Petersburg'),
('Chengdu','Chengdu'),
('Antwerp','Antwerp'),
('London','London')]
# use dictionary keys and values as columns in a dataframe
tourney_lookup = pd.DataFrame(tourney_pairs,columns = ['m_name','o_name'])
tourney_lookup.m_name = tourney_lookup.m_name.str.lower().str.strip()
tourney_lookup.o_name = tourney_lookup.o_name.str.lower().str.strip()
In [ ]:
# this nx2 array contains the differences in ages and the differences in height
X = pd.concat([rawdata.iloc[:,2]-rawdata.iloc[:,3], \
rawdata.iloc[:,4]-rawdata.iloc[:,5]], axis=1).values
# this nx1 binary array indicates whether the match was a "success" or a "failure", as predicted by ranking differences
y = (rawdata.iloc[:,0]-rawdata.iloc[:,1]).values > 0
In [ ]:
# for numerical well-behavedness, we need to scale and center the data
X=(X-np.mean(X,axis=0))/np.std(X,axis=0)
In [ ]:
# plot the normalized data
fig, ax = plt.subplots(1,1)
ax.plot(X[y,0],X[y,1],"ro")
ax.plot(X[~y,0],X[~y,1],"bo")
ax.set_xlabel('Age')
ax.set_ylabel('Height')
ax.set_title('Higher-rank-wins as a function of age and height')
We'll use the scikit
svm package to train an SVM classifier on this data. We'll plot the results as a contour graph.
In [ ]:
# find the SVM classifier
clf = svm.SVC()
clf.fit(X, y)
# generate a dense grid for producing a contour plot
X0, X1 = X[:, 0], X[:, 1]
xx, yy = make_meshgrid(X0, X1)
# feed the grid into the plot_contours routinge
fig, ax = plt.subplots(1, 1)
plot_contours(ax, clf, xx, yy,
cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xlabel('Rank points')
ax.set_ylabel('First serve %')
ax.set_xticks(())
ax.set_yticks(())
ax.set_title('SVM classifier for height/age data')
Note that these features produce a fairly mixed set of points, so they are unlikely to be highly predictive. But it is interesting to note the pocket of blue in the lower left corner: it suggests that, all other being equal, players who are younger and shorter are likely to fair worse than predicted.
In [ ]:
# name of database
db_name = "tennis"
# name of db user
username = "testuser"
# db password for db user
password = "test623"
# location of atp data files
atpfile_directory = "../data/tennis_atp-master/"
# location of odds data files
oddsfiles_directory = "../data/odds_data/"
#%%
#
# PACKAGES
#
import sqlalchemy # pandas-mysql interface library
import sqlalchemy.exc # exception handling
from sqlalchemy import create_engine # needed to define db interface
import glob # for file manipulation
import sys # for defining behavior under errors
#%%
#
# This cell tries to connect to the mysql database "db_name" with the login
# info supplied above. If it succeeds, it prints out the version number of
# mysql, if it fails, it exits gracefully.
#
# create an engine for interacting with the MySQL database
try:
eng_str = 'mysql+mysqldb://' + username + ':' + password + '@localhost/' + db_name
engine = create_engine(eng_str)
connection = engine.connect()
version = connection.execute("SELECT VERSION()")
print("Database version : ")
print(version.fetchone())
# report what went wrong if this fails.
except sqlalchemy.exc.DatabaseError as e:
reason = e.message
print("Error %s:" % (reason))
sys.exit(1)
# close the connection
finally:
if connection:
connection.close()
else:
print("Failed to create connection.")
In [ ]:
# extract from MySQL database info odds
with engine.begin() as connection:
rawdata = pd.read_sql_query("""SELECT PSW, PSL, WRank, LRank FROM odds \
WHERE PSW IS NOT NULL \
AND PSL IS NOT NULL \
AND WRank IS NOT NULL \
AND LRank IS NOT NULL;""", connection)
In [ ]:
investment = len(rawdata)
good_call_idx = (rawdata["LRank"]-rawdata["WRank"]>0)
winner_odds = rawdata["PSW"]
gain = sum(winner_odds*good_call_idx)
roi = gain - investment
print("total invested: ", investment)
print("return on investment: ", roi)
In [ ]:
np.min(winner_odds)
In [ ]:
rawdata
This work is just the begining part of this analysis. The gold standard for professional tennis match prediction is to beat the betting market. There is odds data available for many of these matches, and the proper metric for predictive success is probably percentage increase in profit.
Further work includes the following:
join
commands to extract more complex subsets of the data (i.e. court-type specific data, players of a certain origin, players with certain tournament play patterns, etc.)
In [ ]:
n=10
X = np.random.normal(0,1,(n,2))
y = np.random.choice(a=[False, True], size=n)
X1= np.insert(X,0,1,axis=1)
In [ ]:
n0=100
n1=100
m0 = 0
m1 = 0
s0 = 1
s1 = 1
X = np.concatenate((np.random.normal(m0,s0,(n0,2)), \
np.concatenate((np.random.normal(0,s1,size=(n1,1)), \
np.random.normal(m1,s1,size=(n1,1))),axis=1)),axis=0)
X1 = np.insert(X,0,1,axis=1)
y = np.concatenate((np.ones((n0),dtype=bool),np.zeros((n1),dtype=bool)))
In [ ]:
with engine.begin() as connection:
rawdata = pd.read_sql_query("""SELECT DISTINCT tourney_date, winner_name, winner_rank_points, winner_rank, \
loser_name, loser_rank_points, loser_rank FROM matches \
WHERE tourney_date > '20150101' \
AND tourney_date < '20160101' \
AND tourney_name = "Wimbledon"; """, connection)
In [ ]:
with engine.begin() as connection:
rawdata = pd.read_sql_query("""SELECT odds.Date, matches.tourney_date, odds.Tournament, \
SUBSTRING(odds.winner,1,LOCATE(' ',odds.winner)) AS 'Winner', \
SUBSTRING(odds.loser,1,LOCATE(' ',odds.loser)) AS 'Loser', \
odds.PSW, odds.PSL, \
matches.winner_rank_points, matches.loser_rank_points \
FROM odds \
INNER JOIN matches \
ON odds.Date = matches.tourney_date \
AND odds.Tournament = matches.tourney_name \
AND odds.Tournament = "Wimbledon" \
AND odds.Date > '20150101' \
; """, connection)
In [ ]:
with engine.begin() as connection:
rawdata = pd.read_sql_query("""SELECT odds.Date, matches.tourney_date, odds.Tournament, \
SUBSTRING(odds.winner,1,LOCATE(' ',odds.winner)), \
SUBSTRING(odds.loser,1,LOCATE(' ',odds.loser)) \
FROM odds \
INNER JOIN matches \
ON (odds.Date = matches.tourney_date \
AND odds.Tournament = matches.tourney_name \
AND odds.Tournament = "Wimbledon" \
AND matches.winner_name REGEXP SUBSTRING(odds.winner,1,LOCATE(' ',odds.winner)) \
AND matches.loser_name REGEXP SUBSTRING(odds.loser,1,LOCATE(' ',odds.loser))) \
; """, connection)
In [ ]:
with engine.begin() as connection:
rawdata = pd.read_sql_query("""SELECT (matches.winner_name REGEXP SUBSTRING(odds.winner,1,LOCATE(' ',odds.winner))) AS 'Namematch', \
SUBSTRING(odds.winner,1,LOCATE(' ',odds.winner)) AS 'Winner', \
SUBSTRING(odds.loser,1,LOCATE(' ',odds.loser)) AS 'Loser', \
matches.winner_name, matches.loser_name \
FROM matches \
INNER JOIN odds \
ON (odds.Tournament="Wimbledon" AND \
odds.Winner = "Tomic B.") \
; """, connection)
In [ ]:
# create an engine for interacting with the MySQL database
db_name = "test"
try:
engine.execute("CREATE DATABASE IF NOT EXISTS " + db_name) #create db
engine.execute("USE " + db_name) # select new db
# report what went wrong if this fails.
except sqlalchemy.exc.DatabaseError as e:
reason = e.message
print("Error %s:" % (reason))
sys.exit(1)
In [ ]:
with engine.begin() as connection:
connection.execute("CREATE TABLE matches (tourney_name VARCHAR(256));")
connection.execute("ALTER TABLE matches ADD COLUMN tourney_date DATE;")
connection.execute("ALTER TABLE matches ADD COLUMN winner_name VARCHAR(256);")
connection.execute("ALTER TABLE matches ADD COLUMN winner_rank_points SMALLINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN loser_name VARCHAR(256);")
connection.execute("ALTER TABLE matches ADD COLUMN loser_rank_points SMALLINT UNSIGNED;")
In [ ]:
with engine.begin() as connection:
query=("""CREATE TABLE odds
(Tournament VARCHAR(256),
Date DATE,
Winner VARCHAR(256),
Loser VARCHAR(256),
PSW DECIMAL(5,3),
PSL DECIMAL(5,3))
;""")
connection.execute(query)
In [ ]:
with engine.begin() as connection:
query=("""CREATE TABLE t1
(name VARCHAR(256),
id tinyint unsigned)
;""")
connection.execute(query)
In [ ]:
with engine.begin() as connection:
query=("""CREATE TABLE t2
(name VARCHAR(256),
id tinyint unsigned)
;""")
connection.execute(query)
In [ ]:
with engine.begin() as connection:
connection.execute("""insert into t1 (name, id) values \
("Bob the Blob", 3), \
("Pedro el Gato", 4), \
("Josie Slatterly", 5), \
("Urs Burs", 4);""")
In [ ]:
with engine.begin() as connection:
connection.execute("""insert into t2 (name, id) values \
("Blob B.", 3), \
("Gato P.", 5), \
("Slatterly J.", 5), \
("Burs U.", 3);""")
In [ ]:
with engine.begin() as connection:
rawdata = pd.read_sql_query("""SELECT t1.name, t1.id, t2.name, t2.id, \
SUBSTRING(t2.name,1,LOCATE(' ',t2.name)) \
FROM t1 \
JOIN t2 \
ON (t1.name REGEXP SUBSTRING(t2.name,1,LOCATE(' ',t2.name))) \
; """, connection)
In [ ]:
with engine.begin() as connection:
connection.execute("""insert into matches values ('Wimbledon', '20150629', \
'Jarkko Nieminen', 564, 'Lleyton Hewitt', 118);""")
connection.execute("""insert into matches values ('Wimbledon', '20150629', \
'Pierre Hugues Herbert', 353, 'Hyeon Chung', 79);""")
connection.execute("""insert into matches values ('Wimbledon', '20160629', \
'Bernard Tomic', 1355, 'Jan Lennard Struff', 112);""")
In [ ]:
with engine.begin() as connection:
connection.execute("""insert into odds values ('Wimbledon', '20150629', \
'Przysiezny M.','Ljubicic I.', 4.880, 1.230);""")
connection.execute("""insert into odds values ('Wimbledon', '20150629', \
'Lopez F.','Levine J.', 1.270, 4.300);""")
connection.execute("""insert into odds values ('Wimbledon', '2016-06-21', \
'Lu Y.H.','Zeballos H.', 1.910, 2.020);""")
connection.execute("""insert into odds values ('Wimbledon', '20160629', \
'Tomic B.','Struff J.L.', 1.910, 2.020);""")
In [ ]:
with engine.begin() as connection:
connection.execute("""CREATE VIEW odds_names AS (
SELECT \
SUBSTRING(t2.name,1,LOCATE(' ',t2.name)) AS 'Name' \
FROM t2
);""")
In [ ]:
with engine.begin() as connection:
connection.execute("""select * from t1 inner join on (t1.name regexp odds_names.Name)""")
Bibliography:
In [ ]:
# number of "True" minus the number of "False"
nsamps = sum(y) - sum(~y)
# the smaller class can be associated with either "True" or "False"
smallclass = bool(nsamps<0)
# sample with replacement from data of the appropriate class
samps = X1[np.random.choice(np.where(y==smallclass)[0],size=nsamps)]
# augment the old data
X1 = np.concatenate((X,samps),axis=0)
y1 = np.concatenate((y,np.zeros(nsamps,dtype=bool)))
In [ ]:
import IPython
IPython.display.display(matches)
In [ ]:
In [ ]:
startdate = '20100101'
enddate = '20171231'
with engine.begin() as connection:
matchdata = pd.read_sql_query("""SELECT tourney_date AS Date, tourney_name AS Tournament, \
winner_id AS WID, winner_name AS WName, winner_rank_points AS WPts, \
loser_id AS LID, loser_name AS LName, loser_rank_points AS LPts \
FROM matches \
WHERE tourney_date < '""" + enddate + """' \
AND tourney_date > '""" + startdate + """' \
AND tourney_level = 'G' \
AND winner_rank_points IS NOT NULL \
AND loser_rank_points IS NOT NULL
;""", connection)
oddsdata = pd.read_sql_query("""SELECT Date, Tournament, \
Winner, WPts, Loser, LPts, \
PSW, PSL
FROM odds \
WHERE Date < '""" + enddate + """' \
AND Date > '""" + startdate + """' \
AND Series = 'Grand Slam' \
;""", connection)