Merging "odds" and "player" data

Author: Carl Toews
File: merge_datasets.ipynb

Description:

An obvious metric for assessing the quality of learning algorithms is to compare their profitability against on-line betting markets. The notebook setup_mysql.ipynb loads into a MySQL database files from two distinct sources, one containing characteristics of the players (good for learning), the other containing odds for each match (good for assessing the learning). Each record from the latter presumably corresponds to a unique record from the former.

Unfortunately, there are a couple of issues that make establishing this correspondence difficult, including each of the following:

Incorrect match dates
Incorrect spellings of player names
Listing the winner as the loser, and vice versa.
Inconsistent tournament names
Inconsistent tournament ID numbers

This notebook records some of the methods I used to establish the correspondence. It ultimately produces two DataFrames, one called matches (with player characteristics) and one called odds (with betting characteristics) of the same size and a one-to-one numerical key uniquely identifying each match across both datasets.

Examples of bad data

Bad player names



In [1]:

    
import IPython as IP
IP.display.Image("example_of_name_matching_problems_mod.png",width=400,height=200,embed=True)









    Out[1]:

Bad match dates



In [2]:

    
IP.display.Image("../aux/bad_csv_data_mod.png",width=500,height=500,embed=True)









    Out[2]:

Setup MySQL connection



In [1]:

    
# name of database
db_name = "tennis"
# name of db user
username = "testuser"
# db password for db user
password = "test623"
# location of atp data files
atpfile_directory = "../data/tennis_atp-master/"
# location of odds data files
oddsfiles_directory = "../data/odds_data/"
# we'll read and write pickle files here
pickle_dir = '../pickle_files/'

All import statements here.



In [2]:

    
import sqlalchemy # pandas-mysql interface library
import sqlalchemy.exc # exception handling
from   sqlalchemy import create_engine  # needed to define db interface
import sys # for defining behavior under errors
import numpy as np # numerical libraries
import scipy as sp
import pandas as pd # for data analysis
import pandas.io.sql as sql # for interfacing with MySQL database
import matplotlib as mpl # a big library with plotting functionality
import matplotlib.pyplot as plt # a subset of matplotlib with most of the useful tools
import IPython as IP
%matplotlib inline 
import pdb
#%qtconsole

Try to connect to the tennis database on the local mysql host. If successful, print out the MySQL version number, if unsuccessful, exit gracefully.



In [3]:

    
# create an engine for interacting with the MySQL database
try:
    eng_str = 'mysql+mysqldb://' + username + ':' + password + '@localhost/' + db_name
    engine = create_engine(eng_str)
    connection = engine.connect()
    version = connection.execute("SELECT VERSION()")
    print("Database version : ")
    print(version.fetchone())
    
# report what went wrong if this fails.    
except sqlalchemy.exc.DatabaseError as e:
    reason = e.message
    print("Error %s:" % (reason))
    sys.exit(1)

# close the connection
finally:            
    if connection:    
        connection.close()
    else:
        print("Failed to create connection.")









    



Database version : 
('5.7.16',)

Extract data from MySQL database

We'll focus on cleaning the seven years of data between 2010 and 2016.



In [4]:

    
# focus on most recent data; exclude Davis Cup stuff
startdate = '20100101'
enddate = '20161231'

with engine.begin() as connection:
    odds = pd.read_sql_query("""SELECT * FROM odds \
                                WHERE DATE >= '""" + startdate + """' \
                                AND DATE <= '""" + enddate + """';""", connection)
    
with engine.begin() as connection:
    matches = pd.read_sql_query("""SELECT * FROM matches \
                                WHERE tourney_date >= '""" + startdate + """' \
                                AND tourney_date <= '""" + enddate + """' \
                                AND tourney_name NOT LIKE 'Davis%%';""", connection)
    
# view results
IP.display.display(odds[['ATP','Location','Tournament','Date','Round',
                         'Winner','Loser']].sort_values('Date')[0:5])
IP.display.display(matches[['tourney_id','tourney_name','tourney_date','round',
                            'winner_name','loser_name']].sort_values('tourney_date')[0:5])









    







  
    
      
      ATP
      Location
      Tournament
      Date
      Round
      Winner
      Loser
    
  
  
    
      0
      1
      Brisbane
      Brisbane International
      2010-01-04
      1st Round
      Gasquet R.
      Nieminen J.
    
    
      34
      2
      Chennai
      Chennai Open
      2010-01-04
      1st Round
      Tipsarevic J.
      Moya C.
    
    
      33
      2
      Chennai
      Chennai Open
      2010-01-04
      1st Round
      Giraldo S.
      Phau B.
    
    
      32
      2
      Chennai
      Chennai Open
      2010-01-04
      1st Round
      Hajek J.
      Greul S.
    
    
      31
      2
      Chennai
      Chennai Open
      2010-01-04
      1st Round
      Cilic M.
      Kunitsyn I.
    
  








    







  
    
      
      tourney_id
      tourney_name
      tourney_date
      round
      winner_name
      loser_name
    
  
  
    
      1411
      2010-339
      Brisbane
      2010-01-03
      R32
      Andy Roddick
      Peter Luczak
    
    
      1440
      2010-339
      Brisbane
      2010-01-03
      SF
      Radek Stepanek
      Gael Monfils
    
    
      1439
      2010-339
      Brisbane
      2010-01-03
      SF
      Andy Roddick
      Tomas Berdych
    
    
      1438
      2010-339
      Brisbane
      2010-01-03
      QF
      Radek Stepanek
      Wayne Odesnik
    
    
      1437
      2010-339
      Brisbane
      2010-01-03
      QF
      Gael Monfils
      James Blake

Note some issues already:

Date in matches are often pegged to the start date of the tournament, while dates in odds are often tied to the match itself. Thus time-ordering leads to different results.
The variable tourney_name in matches often corresponds to the variable Location in odds (but not always!)
Rounds are denoted in different ways. "1st round" in odds can match to R32, R64, or R128 on matches.
Name formats are different.
While matches has a unique tournament ID for each tournament, odds recycles tournament ids each year.

Before doing any serious data processing, tidy up the strings (strip whitespace, convert to lowercase, replace dashes by space)



In [5]:

    
odds[['Location','Winner','Loser']] = \
    odds[['Location','Winner','Loser']].\
    apply(lambda x: x.str.strip().str.lower().str.replace('-',' '),axis=1)
matches[['tourney_name','winner_name','loser_name']] = \
    matches[['tourney_name','winner_name','loser_name']].\
    apply(lambda x: x.str.strip().str.lower().str.replace('-',' '),axis=1)

Step I: pair unique tournaments from each set

Each unique tournament has an identifying key in both odds and matches, but the keys are different, and not one to one. Our first task is to associate unique tournaments with one another. We'll start by grouping both odds and matches by unique tournament number.



In [6]:

    
# matches tournament identifiers are unique
g_matches = matches.groupby('tourney_id')
# odds tournament identifiers are recycled every year
g_odds= odds.groupby(['ATP','fname'])

For each tournament (group), extract features to assist in "tourney-matching", including the number of matches in the tournament, the maxium and minimum listed dates, the location (in odds) and the tournament name (in matches.)



In [7]:

    
def extract_odds_features(group):
    sizes = len(group)
    min_date = group['Date'].min()
    max_date = group['Date'].max()
    location = group['Location'].unique()[0]
    return pd.Series({'size': sizes,'min_date':min_date,\
                      'max_date':max_date,'location':location})

def extract_matches_features(group):
    sizes = len(group)
    min_date = group['tourney_date'].min()
    max_date = group['tourney_date'].max()
    tourney_name = group['tourney_name'].unique()[0]
    return pd.Series({'size': sizes,'min_date':min_date,\
                      'max_date':max_date,'tourney_name':tourney_name})
    
g_odds = g_odds.apply(extract_odds_features).reset_index()
g_matches = g_matches.apply(extract_matches_features).reset_index()

Define lookup-table connecting Location in odds to tourney_name in matches. Establishing this matching is not totally straightforward: in many cases, the two columns are identical, but in others they are not, with no obvious connection. (Eg. "Roland Garros" is the tourney_name listed for the French Open in matches, while it's Location is "Paris".) Of the 100 or so difference tennis tournaments, about 60 have a direct correspondence, and the others need some massaging. The derived lookup table is the product of considerable grubby data wrangling.



In [8]:

    
tourney_lookup = pd.read_pickle(pickle_dir + 'tourney_lookup.pkl')
print("Snapshot of lookup table:")
IP.display.display(tourney_lookup.sort_values('o_name')[15:25])









    



Snapshot of lookup table:






    







  
    
      
      m_name
      o_name
    
  
  
    
      17
      casablanca
      casablanca
    
    
      18
      chengdu
      chengdu
    
    
      19
      chennai
      chennai
    
    
      20
      cincinnati masters
      cincinnati
    
    
      21
      costa do sauipe
      costa do sauipe
    
    
      22
      delray beach
      delray beach
    
    
      23
      doha
      doha
    
    
      24
      dubai
      dubai
    
    
      25
      dusseldorf
      dusseldorf
    
    
      58
      power horse cup
      dusseldorf

Define a function that will take as input a set of features from one tournament in odds, and try to match it to one tournament in matches. The algorithm make sure that dates are roughly the same, and that there is a correspondence between Location and tourney_name.



In [9]:

    
def get_tourney_ID(o_row):
    """
    function:  get_tourney_ID(o_row)
    Input:   row from dataframe g_odds
    Output:  a Series object with two elements: 1) a match ID,
                and 2), a flag of True if the sizes of the two tournmanets are identical
    """

    # calculate the diffence in start/stop dates between this tournament and those in `matches`.
    min_date_delta = np.abs(g_matches['min_date'] - o_row['min_date']).apply(lambda x: x.days)
    max_date_delta = np.abs(g_matches['max_date'] - o_row['max_date']).apply(lambda x: x.days)

    # find a list of candidate tournament names, based on lookup table 
    mtchs = (tourney_lookup['o_name']==o_row['location'])
    if sum(mtchs)>0:
        m_name = tourney_lookup.loc[mtchs,'m_name']
    else:
        print('no match found for record {}'.format(o_row['location']))
        return ['Nan','Nan']
    
    # the "right" tournament has the right name, and reasonable close start or stop dates
    idx = ((min_date_delta <=3) | (max_date_delta <=1)) & (g_matches['tourney_name'].isin(m_name))
    record = g_matches.loc[idx,'tourney_id']
    
    # if there are no matches, print some diagnostic information and don't assign a match
    if len(record)<1:
        print("Warning:  no match found for `odds` match {}, year {}".format(o_row.ATP, o_row.fname))
        print("min date delta: {}, max date delta: {}, g_matches: {}".format(np.min(min_date_delta), \
                                                                             np.min(max_date_delta), \
                                                                             g_matches.loc[g_matches['tourney_name'].isin(m_name),'tourney_name'])) 
        return pd.Series({'ID':'None','size':'NA'})
    
    # if there are too many matches, print a warning and don't assign a match.
    elif (len(record)>1):
        print("Warning:  multiple matches found for `odds` match {}".format(o_row.ATP))
        return pd.Series({'ID':'Multiple','size':'NA'})

    # otherwise, assign a match, and check if the sizes of the matches are consistent (a good double-check)
    else:
        size_flag = (g_matches.loc[idx,'size']==o_row['size'])
        return pd.Series({'ID':record.iloc[0],'size':size_flag.iloc[0]})

Perform the matches.



In [10]:

    
# add columns to g_odds to hold match ID and also info about size-correspondence
g_odds.insert(len(g_odds.columns),'ID','None')
g_odds.insert(len(g_odds.columns),'sizes_match','NA')

# perform the match
g_odds[['ID','sizes_match']] = g_odds.apply(get_tourney_ID,axis=1).values

Merge match numbers back into bigger odds table, checking that the sizes of each "matched" tournament are the same.



In [11]:

    
# add "size" columns to both dataframes
odds = pd.merge(g_odds[['ATP','fname','ID','size']],odds,how='inner',on=['ATP','fname'])
matches = pd.merge(g_matches[['tourney_id','size']],matches,how='inner',on=['tourney_id'])

# sum the sizes
if sum(g_odds['sizes_match']==True) != len(g_odds):
    print("Warning:  at least one tournament in `odds` is matched to a \
    tournament in `matches` of a different size.")
else:
    print("Sizes seem to match up.")









    



Sizes seem to match up.

Step II: Pair matches within each tournament.

To help in this process, we first insert an integer index key into both odds and matches. Also assign a match number to each match within a tournament in odds.



In [12]:

    
# for each tournament, label match numbers from 1 to n_tourneys
odds.insert(5,'match_num',0)
grouped = odds[['ID','match_num']].groupby('ID')
odds['match_num'] = grouped.transform(lambda x: 1+np.arange(len(x)))

# add keys to both odds and match data
odds.insert(len(odds.columns),'key',np.arange(len(odds)))
matches.insert(len(matches.columns),'key',np.arange(len(matches)))

We'll use round information to help with the task of pairing individual matches . To use round information effectively, we need to establish a correspondence between round signifiers in the two datasets. The exact correspondence depends on how many tennis matches are in each tournmanet.



In [13]:

    
# figure out how many discrete sizes there are
print("size in odds:  ", odds['size'].unique())
print("size in matches:  ", matches['size'].unique())
print("unique round designators in odds:  ", odds.Round.unique())
print("unique round designators in matches:  ", matches['round'].unique())









    



size in odds:   [ 31  27 127  95  55  47  15]
size in matches:   [ 55  27  31  47  95 127  15  23  26  64]
unique round designators in odds:   ['1st Round' '2nd Round' 'Quarterfinals' 'Semifinals' 'The Final'
 '3rd Round' '4th Round' 'Round Robin']
unique round designators in matches:   ['R64' 'R32' 'R16' 'QF' 'SF' 'F' 'R128' 'RR' 'BR']



In [14]:

    
# create a lookup table to be able to match on rounds
m_rounds = ['R128','R64','R32','R16','QF','SF','F','RR']
o_rounds = ['1st Round','2nd Round','3rd Round','4th Round', \
            'Quarterfinals','Semifinals','The Final','Round Robin']

round_lookup_small = pd.DataFrame({'m_rounds': m_rounds[2:-1],\
                                   'o_rounds':o_rounds[0:2]+o_rounds[4:-1]})

round_lookup_medium = pd.DataFrame({'m_rounds': m_rounds[1:-1],\
                                   'o_rounds':o_rounds[0:3]+o_rounds[4:-1]})

round_lookup_large = pd.DataFrame({'m_rounds': m_rounds[0:-1],\
                                   'o_rounds':o_rounds[0:-1]})

round_lookup_RR  = pd.DataFrame({'m_rounds':m_rounds[5:],\
                                 'o_rounds':o_rounds[5:]})

With the round lookup tables defind, define a function that takes a row of odds, and figures how to map its round information to the round information in matches.



In [15]:

    
def map_rounds(x):

    cur_name = x['Round']
    t_size = x['size']

    if t_size in [27,31]:
        new_name = round_lookup_small.loc[round_lookup_small.o_rounds==cur_name,'m_rounds']
    elif t_size in [47,55]:
        new_name = round_lookup_medium.loc[round_lookup_medium.o_rounds==cur_name,'m_rounds']
    elif t_size in [95, 127]:
        new_name = round_lookup_large.loc[round_lookup_large.o_rounds==cur_name,'m_rounds']
    else:
        new_name = round_lookup_RR.loc[round_lookup_RR.o_rounds==cur_name,'m_rounds'] 

    return new_name.iloc[0]

We'll apply that mapping to each row of the odds dataframe.



In [16]:

    
# translate round indentifier appropriately
odds.insert(4,'round','TBD')
odds['round'] = odds.apply(map_rounds,axis=1).values

IP.display.display(odds[0:4])
IP.display.display(matches[0:4])









    







  
    
      
      ATP
      fname
      ID
      size
      round
      Location
      match_num
      Tournament
      Date
      Series
      ...
      LBL
      PSW
      PSL
      SJW
      SJL
      MaxW
      MaxL
      AvgW
      AvgL
      key
    
  
  
    
      0
      1
      2010
      2010-339
      31
      R32
      brisbane
      1
      Brisbane International
      2010-01-04
      ATP250
      ...
      2.375
      1.526
      2.740
      1.500
      2.500
      NaN
      NaN
      NaN
      NaN
      0
    
    
      1
      1
      2010
      2010-339
      31
      R32
      brisbane
      2
      Brisbane International
      2010-01-04
      ATP250
      ...
      1.571
      2.140
      1.813
      2.250
      1.571
      NaN
      NaN
      NaN
      NaN
      1
    
    
      2
      1
      2010
      2010-339
      31
      R32
      brisbane
      3
      Brisbane International
      2010-01-04
      ATP250
      ...
      2.100
      1.676
      2.360
      1.615
      2.200
      NaN
      NaN
      NaN
      NaN
      2
    
    
      3
      1
      2010
      2010-339
      31
      R32
      brisbane
      4
      Brisbane International
      2010-01-04
      ATP250
      ...
      1.500
      2.580
      1.581
      2.500
      1.500
      NaN
      NaN
      NaN
      NaN
      3
    
  

4 rows × 48 columns







    







  
    
      
      tourney_id
      size
      tourney_name
      surface
      draw_size
      tourney_level
      tourney_date
      match_num
      winner_id
      winner_seed
      ...
      l_df
      l_svpt
      l_1stIn
      l_1stWon
      l_2ndWon
      l_SvGms
      l_bpSaved
      l_bpFaced
      match_type
      key
    
  
  
    
      0
      2010-1536
      55
      madrid masters
      Clay
      56
      M
      2010-05-09
      1
      103794
      NaN
      ...
      2.0
      46.0
      25.0
      11.0
      8.0
      7.0
      4.0
      9.0
      atp_matches
      0
    
    
      1
      2010-1536
      55
      madrid masters
      Clay
      56
      M
      2010-05-09
      2
      104919
      NaN
      ...
      7.0
      110.0
      71.0
      52.0
      19.0
      17.0
      1.0
      4.0
      atp_matches
      1
    
    
      2
      2010-1536
      55
      madrid masters
      Clay
      56
      M
      2010-05-09
      3
      104527
      15.0
      ...
      1.0
      62.0
      33.0
      17.0
      11.0
      10.0
      3.0
      9.0
      atp_matches
      2
    
    
      3
      2010-1536
      55
      madrid masters
      Clay
      56
      M
      2010-05-09
      4
      104022
      10.0
      ...
      1.0
      73.0
      45.0
      29.0
      12.0
      12.0
      5.0
      10.0
      atp_matches
      3
    
  

4 rows × 52 columns

Before going on, a quick sanity check: is the set of matches in matches that are in odds the same size as the set of matches in odds that are in matches? Does the size column have the correct data?



In [17]:

    
t1=odds.ID.drop_duplicates().sort_values()
t2=matches.tourney_id.drop_duplicates().sort_values()
m_sizes=matches.loc[matches.tourney_id.isin(t1),['tourney_id','size']].drop_duplicates()
o_sizes=odds.loc[odds.ID.isin(t2),['ID','size']].drop_duplicates()
#comp = pd.merge(o_sizes,m_sizes,how='outer',left_on='ID',right_on='tourney_id')
print('sum of sizes of tournaments in odds: ', np.sum(o_sizes['size']))
print('sum of sizes of tournaments in matches: ', np.sum(m_sizes['size']))









    



sum of sizes of tournaments in odds:  18448
sum of sizes of tournaments in matches:  18448

Looks good. Now pare down the matches dataframe to contain only records in odds.



In [18]:

    
matches = matches.loc[matches.tourney_id.isin(t1),:]
print("number of records in `odds`:  ", len(odds))
print("number of records in `matches`:  ", len(matches))









    



number of records in `odds`:   18448
number of records in `matches`:   18448

To do player name matching, we split each name, and attempt to match on substrings.



In [19]:

    
# extract dataframe with player names split into discrete 'words'
m_players = pd.merge(matches.winner_name.str.split(pat=' ',expand=True), \
                     matches.loser_name.str.split(pat=' ',expand=True), \
                    how='inner',left_index=True, right_index=True,suffixes=('_W','_L'))  

# add on tournament, round, and match identifiers
m_players = pd.merge(matches[['tourney_id','match_num', 'round','key']], m_players,\
                    how='inner',left_index=True, right_index=True).sort_values(['tourney_id','round','1_W','1_L'])

# extract dataframe with player names split into discrete 'words'
o_players = pd.merge(odds.Winner.str.split(pat=' ',expand=True), \
                     odds.Loser.str.split(pat=' ',expand=True), \
                    how='inner',left_index=True, right_index=True,suffixes=('_W','_L')) 

# add on tournament and round identifiers
o_players = pd.merge(odds[['ID','round','match_num','key']], o_players,\
                    how='inner',left_index=True, right_index=True).sort_values(['ID','round','0_W','0_L'])

print("m_players: ")
IP.display.display(m_players[0:5])
print("o_players")
IP.display.display(o_players[0:5])









    



m_players: 






    







  
    
      
      tourney_id
      match_num
      round
      key
      0_W
      1_W
      2_W
      3_W
      4_W
      0_L
      1_L
      2_L
      3_L
      4_L
    
  
  
    
      54
      2010-1536
      55
      F
      54
      rafael
      nadal
      None
      None
      None
      roger
      federer
      None
      None
      None
    
    
      50
      2010-1536
      51
      QF
      50
      nicolas
      almagro
      None
      None
      None
      jurgen
      melzer
      None
      None
      None
    
    
      48
      2010-1536
      49
      QF
      48
      roger
      federer
      None
      None
      None
      ernests
      gulbis
      None
      None
      None
    
    
      49
      2010-1536
      50
      QF
      49
      david
      ferrer
      None
      None
      None
      andy
      murray
      None
      None
      None
    
    
      51
      2010-1536
      52
      QF
      51
      rafael
      nadal
      None
      None
      None
      gael
      monfils
      None
      None
      None
    
  








    



o_players






    







  
    
      
      ID
      round
      match_num
      key
      0_W
      1_W
      2_W
      3_W
      4_W
      0_L
      1_L
      2_L
      3_L
      4_L
    
  
  
    
      7722
      2010-1536
      F
      55
      7722
      nadal
      r.
      None
      None
      None
      federer
      r.
      None
      None
      None
    
    
      7716
      2010-1536
      QF
      49
      7716
      almagro
      n.
      None
      None
      None
      melzer
      j.
      None
      None
      None
    
    
      7718
      2010-1536
      QF
      51
      7718
      federer
      r.
      None
      None
      None
      gulbis
      e.
      None
      None
      None
    
    
      7719
      2010-1536
      QF
      52
      7719
      ferrer
      d.
      None
      None
      None
      murray
      a.
      None
      None
      None
    
    
      7717
      2010-1536
      QF
      50
      7717
      nadal
      r.
      None
      None
      None
      monfils
      g.
      None
      None
      None

Match individual matches as follows: assuming there is a match on both tournament number and round, then

First try to match both winner and loser names
In the compliment of the result, try to match just winner name or just loser names.
Merge 1. and 2., see what's left over.



In [20]:

    
# try for an exact match on last names of both winner and loser
A = pd.merge(m_players[['tourney_id','round','key','1_W','1_L']],\
             o_players[['ID','round','key','0_W','0_L']],how='inner',\
             left_on=['tourney_id','round','1_W','1_L'],\
             right_on=['ID','round','0_W','0_L'],suffixes=['_m','_o'])

m_extras = m_players.loc[~m_players.key.isin(A.key_m),:]
o_extras = o_players.loc[~o_players.key.isin(A.key_o),:]

print("A total of {} matches down.  {} remain.".format(len(A),len(m_extras)))









    



A total of 16821 matches down.  1627 remain.



In [21]:

    
A.key_o.unique().size









    Out[21]:





16821

Take a quick peak at the remaining names to get a sense for what the issues are



In [22]:

    
IP.display.display(m_extras[0:10])
IP.display.display(o_extras[0:10])









    







  
    
      
      tourney_id
      match_num
      round
      key
      0_W
      1_W
      2_W
      3_W
      4_W
      0_L
      1_L
      2_L
      3_L
      4_L
    
  
  
    
      36
      2010-1536
      37
      R32
      36
      guillermo
      garcia
      lopez
      None
      None
      jo
      wilfried
      tsonga
      None
      None
    
    
      28
      2010-1536
      29
      R32
      28
      andy
      murray
      None
      None
      None
      juan
      ignacio
      chela
      None
      None
    
    
      6
      2010-1536
      7
      R64
      6
      juan
      ignacio
      chela
      None
      None
      pablo
      cuevas
      None
      None
      None
    
    
      11
      2010-1536
      12
      R64
      11
      eduardo
      schwank
      None
      None
      None
      paul
      henri
      mathieu
      None
      None
    
    
      57
      2010-1720
      3
      R32
      57
      olivier
      rochus
      None
      None
      None
      juan
      martin
      del
      potro
      None
    
    
      101
      2010-2276
      20
      R16
      101
      illya
      marchenko
      None
      None
      None
      ivan
      dodig
      None
      None
      None
    
    
      100
      2010-2276
      19
      R16
      100
      jurgen
      melzer
      None
      None
      None
      olivier
      rochus
      None
      None
      None
    
    
      82
      2010-2276
      1
      R32
      82
      marin
      cilic
      None
      None
      None
      jan
      hajek
      None
      None
      None
    
    
      85
      2010-2276
      4
      R32
      85
      ivo
      karlovic
      None
      None
      None
      antonio
      veic
      None
      None
      None
    
    
      127
      2010-301
      15
      R16
      127
      michael
      lammer
      None
      None
      None
      juan
      carlos
      ferrero
      None
      None
    
  








    







  
    
      
      ID
      round
      match_num
      key
      0_W
      1_W
      2_W
      3_W
      4_W
      0_L
      1_L
      2_L
      3_L
      4_L
    
  
  
    
      7707
      2010-1536
      R32
      40
      7707
      garcia
      lopez
      g.
      None
      None
      tsonga
      j.w.
      None
      None
      None
    
    
      7705
      2010-1536
      R32
      38
      7705
      murray
      a.
      None
      None
      None
      chela
      j.i.
      None
      None
      None
    
    
      7680
      2010-1536
      R64
      13
      7680
      chela
      j.i.
      None
      None
      None
      cuevas
      p.
      None
      None
      None
    
    
      7679
      2010-1536
      R64
      12
      7679
      schwank
      e.
      None
      None
      None
      mathieu
      p.h.
      None
      None
      None
    
    
      15205
      2010-1720
      R32
      11
      15205
      rochus
      o.
      None
      None
      None
      del
      potro
      j.m.
      None
      None
    
    
      2293
      2010-2276
      R16
      18
      2293
      cilic
      m.
      None
      None
      None
      hajek
      j.
      None
      None
      None
    
    
      2294
      2010-2276
      R16
      19
      2294
      karlovic
      i.
      None
      None
      None
      veic
      a.
      None
      None
      None
    
    
      2289
      2010-2276
      R32
      14
      2289
      marchenko
      i.
      None
      None
      None
      dodig
      i.
      None
      None
      None
    
    
      2291
      2010-2276
      R32
      16
      2291
      melzer
      j.
      None
      None
      None
      rochus
      o.
      None
      None
      None
    
    
      627
      2010-301
      R16
      13
      627
      lammer
      m.
      None
      None
      None
      ferrero
      j.c.
      None
      None
      None

Now match across all substrings in a name. To do so, we need a function that will return True if there is a match between one or more strings in two string lists.



In [23]:

    
def comp_str_lists(a,b):
    """checks to see if any of the strings in list a are also in list b"""
    for i in a:
        if i in b:
            return True        
    return False

We also need a function that will take each row of odds, and try to find a match for some appropriate subchunk of matches.



In [24]:

    
def comp_all_cols(o_row):
    """
    input:  row of o_players
    output: 
    """
    
    m_chunk = m_extras.loc[(m_extras.tourney_id==o_row['ID']) & (m_extras['round']==o_row['round'])]
    o_winner = list(o_row[['0_W','1_W','2_W','3_W','4_W']].dropna())
    o_loser = list(o_row[['0_L','1_L','2_L','3_L','4_L']].dropna())
    
    pairing = []
    if len(m_chunk)==0:
        print("warning:  no match/round pairing found for o_row key {}".format(o_row['key']))
        return 0
    
    for i, m_row in m_chunk.iterrows():
        
        m_winner = list(m_row[['0_W','1_W','2_W','3_W','4_W']].dropna())
        m_loser = list(m_row[['0_L','1_L','2_L','3_L','4_L']].dropna())
        
        pairing.append(comp_str_lists(o_winner,m_winner) & (comp_str_lists(o_loser,o_loser)))
    
    
    if sum(pairing) == 1:
        m_row = m_chunk.iloc[np.array(pairing),:]
        return pd.Series({'key_o':o_row['key'],'key_m':m_row['key'].iloc[0]})

    elif sum(pairing)<1:
        print("warning:  no name matches for o_row key {}".format(o_row['key']))
        return 0
    
    else:
        print("warning:  multiple name matches for o_row key {}".format(o_row['key']))
        return 0



In [25]:

    
new_matches = o_extras.apply(comp_all_cols,axis=1)









    



warning:  no name matches for o_row key 2293
warning:  no name matches for o_row key 2294
warning:  no name matches for o_row key 2289
warning:  no name matches for o_row key 2291
warning:  no name matches for o_row key 2694
warning:  no name matches for o_row key 2691
warning:  no name matches for o_row key 2896
warning:  no name matches for o_row key 2897
warning:  no name matches for o_row key 2894
warning:  no name matches for o_row key 2893
warning:  no name matches for o_row key 3943
warning:  no name matches for o_row key 8349
warning:  no name matches for o_row key 8407
warning:  no name matches for o_row key 1097
warning:  no name matches for o_row key 11325
warning:  no name matches for o_row key 11319
warning:  no name matches for o_row key 11314
warning:  multiple name matches for o_row key 18283
warning:  multiple name matches for o_row key 18288
warning:  no name matches for o_row key 15263
warning:  no name matches for o_row key 2157
warning:  no name matches for o_row key 2152
warning:  no name matches for o_row key 3158
warning:  no name matches for o_row key 1267
warning:  multiple name matches for o_row key 18304
warning:  multiple name matches for o_row key 18301

Update match list, and check progress.



In [26]:

    
new_matches = new_matches.loc[(new_matches.key_m!=0)&(new_matches.key_o!=0),:]
A = pd.concat([A[['key_m','key_o']],new_matches])

m_extras = m_players.loc[~m_players.key.isin(A.key_m),:]
o_extras = o_players.loc[~o_players.key.isin(A.key_o),:]

print("A total of {} matches down.  {} remain.".format(len(A),len(m_extras)))









    



A total of 18422 matches down.  26 remain.

Take a peak at the remaining names and see what the problems are.



In [27]:

    
IP.display.display(m_extras.sort_values('0_W')[0:10])
IP.display.display(o_extras.sort_values('1_L')[0:10])









    







  
    
      
      tourney_id
      match_num
      round
      key
      0_W
      1_W
      2_W
      3_W
      4_W
      0_L
      1_L
      2_L
      3_L
      4_L
    
  
  
    
      1257
      2010-424
      1
      R32
      1257
      andy
      roddick
      None
      None
      None
      ryler
      deheart
      None
      None
      None
    
    
      1585
      2010-499
      19
      R16
      1585
      benjamin
      becker
      None
      None
      None
      wayne
      odesnik
      None
      None
      None
    
    
      1861
      2010-520
      27
      R128
      1861
      carsten
      ball
      None
      None
      None
      philipp
      petzschner
      None
      None
      None
    
    
      1902
      2010-520
      68
      R64
      1902
      fabio
      fognini
      None
      None
      None
      gael
      monfils
      None
      None
      None
    
    
      7129
      2012-506
      15
      R32
      7129
      federico
      delbonis
      None
      None
      None
      frederico
      gil
      None
      None
      None
    
    
      7069
      2012-505
      9
      R32
      7069
      federico
      delbonis
      None
      None
      None
      nicolas
      massu
      None
      None
      None
    
    
      7078
      2012-505
      18
      R16
      7078
      federico
      delbonis
      None
      None
      None
      thomaz
      bellucci
      None
      None
      None
    
    
      3070
      2011-321
      4
      R32
      3070
      federico
      delbonis
      None
      None
      None
      florian
      mayer
      None
      None
      None
    
    
      3084
      2011-321
      18
      R16
      3084
      federico
      delbonis
      None
      None
      None
      sergiy
      stakhovsky
      None
      None
      None
    
    
      3091
      2011-321
      25
      QF
      3091
      federico
      delbonis
      None
      None
      None
      pavol
      cervenak
      None
      None
      None
    
  








    







  
    
      
      ID
      round
      match_num
      key
      0_W
      1_W
      2_W
      3_W
      4_W
      0_L
      1_L
      2_L
      3_L
      4_L
    
  
  
    
      2294
      2010-2276
      R16
      19
      2294
      karlovic
      i.
      None
      None
      None
      veic
      a.
      None
      None
      None
    
    
      3943
      2010-499
      R16
      20
      3943
      odesnik
      w.
      None
      None
      None
      becker
      b.
      None
      None
      None
    
    
      8349
      2010-520
      R128
      24
      8349
      petzschner
      p.
      None
      None
      None
      ball
      c.
      None
      None
      None
    
    
      2893
      2010-424
      R32
      15
      2893
      kohlschreiber
      p.
      None
      None
      None
      sela
      d.
      None
      None
      None
    
    
      8407
      2010-520
      R64
      82
      8407
      monfils
      g.
      None
      None
      None
      fognini
      f.
      None
      None
      None
    
    
      1267
      2012-580
      R128
      21
      1267
      darcis
      s.
      None
      None
      None
      serra
      f.
      None
      None
      None
    
    
      3158
      2012-506
      R32
      13
      3158
      del
      bonis
      f.
      None
      None
      gil
      f.
      None
      None
      None
    
    
      11314
      2011-321
      R32
      14
      11314
      del
      bonis
      f.
      None
      None
      mayer
      f.
      None
      None
      None
    
    
      2289
      2010-2276
      R32
      14
      2289
      marchenko
      i.
      None
      None
      None
      dodig
      i.
      None
      None
      None
    
    
      2691
      2010-407
      R32
      14
      2691
      soderling
      r.
      None
      None
      None
      sijsling
      i.
      None
      None
      None

Some rounds are wrong. Try re-matching both winner and loser last names, without insisting on round information.



In [28]:

    
B = pd.merge(m_extras[['tourney_id','key','1_W','1_L']],\
             o_extras[['ID','round','key','0_W','0_L']],how='inner',\
             left_on=['tourney_id','1_W','1_L'],\
             right_on=['ID','0_W','0_L'],suffixes=['_m','_o'])
A = pd.concat([A,B[['key_m','key_o']]])

m_extras = m_players.loc[~m_players.key.isin(A.key_m),:]
o_extras = o_players.loc[~o_players.key.isin(A.key_o),:]

print("A total of {} matches down.  {} remain.".format(len(A),len(m_extras)))

IP.display.display(m_extras[0:4])
IP.display.display(o_extras[0:4])









    



A total of 18431 matches down.  17 remain.






    







  
    
      
      tourney_id
      match_num
      round
      key
      0_W
      1_W
      2_W
      3_W
      4_W
      0_L
      1_L
      2_L
      3_L
      4_L
    
  
  
    
      1272
      2010-424
      16
      R32
      1272
      fernando
      verdasco
      None
      None
      None
      yen
      hsun
      lu
      None
      None
    
    
      1585
      2010-499
      19
      R16
      1585
      benjamin
      becker
      None
      None
      None
      wayne
      odesnik
      None
      None
      None
    
    
      1861
      2010-520
      27
      R128
      1861
      carsten
      ball
      None
      None
      None
      philipp
      petzschner
      None
      None
      None
    
    
      1902
      2010-520
      68
      R64
      1902
      fabio
      fognini
      None
      None
      None
      gael
      monfils
      None
      None
      None
    
  








    







  
    
      
      ID
      round
      match_num
      key
      0_W
      1_W
      2_W
      3_W
      4_W
      0_L
      1_L
      2_L
      3_L
      4_L
    
  
  
    
      2897
      2010-424
      R16
      19
      2897
      verdasco
      f.
      None
      None
      None
      lu
      y.h.
      None
      None
      None
    
    
      3943
      2010-499
      R16
      20
      3943
      odesnik
      w.
      None
      None
      None
      becker
      b.
      None
      None
      None
    
    
      8349
      2010-520
      R128
      24
      8349
      petzschner
      p.
      None
      None
      None
      ball
      c.
      None
      None
      None
    
    
      8407
      2010-520
      R64
      82
      8407
      monfils
      g.
      None
      None
      None
      fognini
      f.
      None
      None
      None

That solved some. Now try matching unusual names, ignorning rounds. This involves slightly modifying the comparison function.



In [29]:

    
def comp_all_cols_no_rounds(o_row):
    """
    input:  row of o_players
    output: 
    """
    
    m_chunk = m_extras.loc[(m_extras.tourney_id==o_row['ID'])]
    o_winner = list(o_row[['0_W','1_W','2_W','3_W','4_W']].dropna())
    o_loser = list(o_row[['0_L','1_L','2_L','3_L','4_L']].dropna())
    
    pairing = []
    if len(m_chunk)==0:
        print("warning:  no match/round pairing found for o_row key {}".format(o_row['key']))
        return 0
    
    for i, m_row in m_chunk.iterrows():
        
        m_winner = list(m_row[['0_W','1_W','2_W','3_W','4_W']].dropna())
        m_loser = list(m_row[['0_L','1_L','2_L','3_L','4_L']].dropna())
        
        pairing.append(comp_str_lists(o_winner,m_winner) & (comp_str_lists(o_loser,m_loser)))
    
    
    if sum(pairing) == 1:
        m_row = m_chunk.iloc[np.array(pairing),:]
        return pd.Series({'key_o':o_row['key'],'key_m':m_row['key'].iloc[0]})
    elif sum(pairing)<1:
        print("warning:  no name matches for o_row key {}".format(o_row['key']))
        return pd.Series({'key_o':0,'key_m':0})
    else:
        print("warning:  multiple name matches for o_row key {}".format(o_row['key']))
        print(m_chunk.iloc[np.array(pairing),:])
        return pd.Series({'key_o':0,'key_m':0})



In [30]:

    
new_matches = o_extras.apply(comp_all_cols_no_rounds,axis=1)
new_matches = new_matches.loc[(new_matches.key_m!=0)&(new_matches.key_o!=0),:]
A = pd.concat([A[['key_m','key_o']],new_matches])









    



warning:  no name matches for o_row key 3943
warning:  no name matches for o_row key 8349
warning:  no name matches for o_row key 8407
warning:  no name matches for o_row key 1097
warning:  no name matches for o_row key 11325
warning:  no name matches for o_row key 11319
warning:  no name matches for o_row key 11314
warning:  no name matches for o_row key 15263
warning:  no name matches for o_row key 2157
warning:  no name matches for o_row key 2152
warning:  no name matches for o_row key 3158
warning:  no name matches for o_row key 1267

Still some errors. Let's take a look.



In [31]:

    
m_extras = m_players.loc[~m_players.key.isin(A.key_m),:]
o_extras = o_players.loc[~o_players.key.isin(A.key_o),:]

print("A total of {} matches down.  {} remain.".format(len(A),len(m_extras)))

IP.display.display(m_extras)
IP.display.display(o_extras)









    



A total of 18436 matches down.  12 remain.






    







  
    
      
      tourney_id
      match_num
      round
      key
      0_W
      1_W
      2_W
      3_W
      4_W
      0_L
      1_L
      2_L
      3_L
      4_L
    
  
  
    
      1585
      2010-499
      19
      R16
      1585
      benjamin
      becker
      None
      None
      None
      wayne
      odesnik
      None
      None
      None
    
    
      1861
      2010-520
      27
      R128
      1861
      carsten
      ball
      None
      None
      None
      philipp
      petzschner
      None
      None
      None
    
    
      1902
      2010-520
      68
      R64
      1902
      fabio
      fognini
      None
      None
      None
      gael
      monfils
      None
      None
      None
    
    
      2410
      2010-580
      102
      R32
      2410
      lukasz
      kubot
      None
      None
      None
      mikhail
      youzhny
      None
      None
      None
    
    
      3091
      2011-321
      25
      QF
      3091
      federico
      delbonis
      None
      None
      None
      pavol
      cervenak
      None
      None
      None
    
    
      3084
      2011-321
      18
      R16
      3084
      federico
      delbonis
      None
      None
      None
      sergiy
      stakhovsky
      None
      None
      None
    
    
      3070
      2011-321
      4
      R32
      3070
      federico
      delbonis
      None
      None
      None
      florian
      mayer
      None
      None
      None
    
    
      5971
      2012-341
      15
      R16
      5971
      ivo
      karlovic
      None
      None
      None
      marcel
      granollers
      None
      None
      None
    
    
      7078
      2012-505
      18
      R16
      7078
      federico
      delbonis
      None
      None
      None
      thomaz
      bellucci
      None
      None
      None
    
    
      7069
      2012-505
      9
      R32
      7069
      federico
      delbonis
      None
      None
      None
      nicolas
      massu
      None
      None
      None
    
    
      7129
      2012-506
      15
      R32
      7129
      federico
      delbonis
      None
      None
      None
      frederico
      gil
      None
      None
      None
    
    
      7653
      2012-580
      38
      R128
      7653
      florent
      serra
      None
      None
      None
      steve
      darcis
      None
      None
      None
    
  








    







  
    
      
      ID
      round
      match_num
      key
      0_W
      1_W
      2_W
      3_W
      4_W
      0_L
      1_L
      2_L
      3_L
      4_L
    
  
  
    
      3943
      2010-499
      R16
      20
      3943
      odesnik
      w.
      None
      None
      None
      becker
      b.
      None
      None
      None
    
    
      8349
      2010-520
      R128
      24
      8349
      petzschner
      p.
      None
      None
      None
      ball
      c.
      None
      None
      None
    
    
      8407
      2010-520
      R64
      82
      8407
      monfils
      g.
      None
      None
      None
      fognini
      f.
      None
      None
      None
    
    
      1097
      2010-580
      R32
      105
      1097
      youzhny
      m.
      None
      None
      None
      kubot
      l.
      None
      None
      None
    
    
      11325
      2011-321
      QF
      25
      11325
      del
      bonis
      f.
      None
      None
      cervenak
      p.
      None
      None
      None
    
    
      11319
      2011-321
      R16
      19
      11319
      del
      bonis
      f.
      None
      None
      stakhovsky
      s.
      None
      None
      None
    
    
      11314
      2011-321
      R32
      14
      11314
      del
      bonis
      f.
      None
      None
      mayer
      f.
      None
      None
      None
    
    
      15263
      2012-341
      R16
      15
      15263
      granollers
      m.
      None
      None
      None
      karlovic
      i.
      None
      None
      None
    
    
      2157
      2012-505
      R16
      17
      2157
      del
      bonis
      f.
      None
      None
      bellucci
      t.
      None
      None
      None
    
    
      2152
      2012-505
      R32
      12
      2152
      del
      bonis
      f.
      None
      None
      massu
      n.
      None
      None
      None
    
    
      3158
      2012-506
      R32
      13
      3158
      del
      bonis
      f.
      None
      None
      gil
      f.
      None
      None
      None
    
    
      1267
      2012-580
      R128
      21
      1267
      darcis
      s.
      None
      None
      None
      serra
      f.
      None
      None
      None

Two big problems. One is 'delbonis' vs. 'del bonis'.



In [32]:

    
# solve the delbonis problem
o_extras.loc[o_extras['1_W']=='bonis',('0_W','1_W')]  = ['delbonis',None]

B = pd.merge(m_extras[['tourney_id','key','1_W','1_L']],\
             o_extras[['ID','round','key','0_W','0_L']],how='inner',\
             left_on=['tourney_id','1_W','1_L'],\
             right_on=['ID','0_W','0_L'],suffixes=['_m','_o'])
A = pd.concat([A,B[['key_m','key_o']]])

m_extras = m_players.loc[~m_players.key.isin(A.key_m),:]
o_extras = o_players.loc[~o_players.key.isin(A.key_o),:]

print("A total of {} matches down.  {} remain.".format(len(A),len(m_extras)))

IP.display.display(m_extras)
IP.display.display(o_extras)









    



A total of 18442 matches down.  6 remain.






    



/Users/ctoews/anaconda/envs/py3_6_2/lib/python3.6/site-packages/pandas/core/indexing.py:517: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s






    







  
    
      
      tourney_id
      match_num
      round
      key
      0_W
      1_W
      2_W
      3_W
      4_W
      0_L
      1_L
      2_L
      3_L
      4_L
    
  
  
    
      1585
      2010-499
      19
      R16
      1585
      benjamin
      becker
      None
      None
      None
      wayne
      odesnik
      None
      None
      None
    
    
      1861
      2010-520
      27
      R128
      1861
      carsten
      ball
      None
      None
      None
      philipp
      petzschner
      None
      None
      None
    
    
      1902
      2010-520
      68
      R64
      1902
      fabio
      fognini
      None
      None
      None
      gael
      monfils
      None
      None
      None
    
    
      2410
      2010-580
      102
      R32
      2410
      lukasz
      kubot
      None
      None
      None
      mikhail
      youzhny
      None
      None
      None
    
    
      5971
      2012-341
      15
      R16
      5971
      ivo
      karlovic
      None
      None
      None
      marcel
      granollers
      None
      None
      None
    
    
      7653
      2012-580
      38
      R128
      7653
      florent
      serra
      None
      None
      None
      steve
      darcis
      None
      None
      None
    
  








    







  
    
      
      ID
      round
      match_num
      key
      0_W
      1_W
      2_W
      3_W
      4_W
      0_L
      1_L
      2_L
      3_L
      4_L
    
  
  
    
      3943
      2010-499
      R16
      20
      3943
      odesnik
      w.
      None
      None
      None
      becker
      b.
      None
      None
      None
    
    
      8349
      2010-520
      R128
      24
      8349
      petzschner
      p.
      None
      None
      None
      ball
      c.
      None
      None
      None
    
    
      8407
      2010-520
      R64
      82
      8407
      monfils
      g.
      None
      None
      None
      fognini
      f.
      None
      None
      None
    
    
      1097
      2010-580
      R32
      105
      1097
      youzhny
      m.
      None
      None
      None
      kubot
      l.
      None
      None
      None
    
    
      15263
      2012-341
      R16
      15
      15263
      granollers
      m.
      None
      None
      None
      karlovic
      i.
      None
      None
      None
    
    
      1267
      2012-580
      R128
      21
      1267
      darcis
      s.
      None
      None
      None
      serra
      f.
      None
      None
      None

What remains is exclusively a mismatch of winner and loser. First sort and match keys, then check who really won and correct the data.



In [33]:

    
m_extras = m_extras.sort_values(['tourney_id','1_W'])
o_extras = o_extras.sort_values(['ID','0_L'])

dregs=pd.DataFrame(list(zip(m_extras['key'].values, \
                            o_extras['key'].values)),\
                        columns=['key_m','key_o'])

A = pd.concat([A,dregs[['key_m','key_o']]])

To see who really won, calculate who played the most rounds.



In [34]:

    
def find_winner(o_row):
    ID = o_row['ID']
    # nominal winner in `odds`
    name1 = o_row['0_W']
    # nominal loser in `odds`
    name2 = o_row['0_L']
    # number of rounds played by nominal winner
    rnds1 = len(odds.loc[(odds.ID==ID) & \
                         (odds.Winner.str.contains(name1) | \
                          odds.Loser.str.contains(name1)),:])
    # number of rounds played by nominal loser
    rnds2 = len(odds.loc[(odds.ID==ID) & \
                         (odds.Winner.str.contains(name2) | \
                          odds.Loser.str.contains(name2)),:])
    # if nominal winner played more rounds, `odds` is right and `matches` is wrong
    if rnds1>rnds2:
        print('Winner:  ', name1)
        return 'm'
    # otherwise, `odds` is wrong and `matches` is right.
    elif rnds1<rnds2:
        print('Winner:  ', name2)
        return 'o'
    else:
        print("function find_winner:  ambigous outcome")
        return np.nan



In [35]:

    
mistake_idx = o_extras.apply(find_winner,axis=1)









    



Winner:   becker
Winner:   ball
Winner:   fognini
Winner:   kubot
Winner:   karlovic
Winner:   serra

Correct mistakes any mistakes in either table.



In [36]:

    
# fix messed up `odds` records
o_errs = o_extras.loc[mistake_idx.values=='o',:]
if len(o_errs)!=0:
    temp = odds.loc[odds.key.isin(o_errs['key']),'Winner']
    odds.loc[odds.key.isin(o_errs['key']),'Winner']=\
        odds.loc[odds.key.isin(o_errs['key']),'Loser'].values
    odds.loc[odds.key.isin(o_errs['key']),'Loser']=temp.values

# fix messed up `matches` records
m_errs = m_extras.loc[mistake_idx.values=='m',:]
if len(m_errs)!=0:
    temp = matches.loc[matches.key.isin(o_errs['key']),'winner_name']
    matches.loc[matches.key.isin(o_errs['key']),'winner_name']=\
        matches.loc[matches.key.isin(o_errs['key']),'loser_name'].values
    matches.loc[matches.key.isin(o_errs['key']),'loser_name']=temp.values



In [37]:

    
#sanity check
print("odds has {} records".format(len(odds)))
print("our lookup table is of size {}".format(len(A)))
print("the table has {} unique keys for `matches`".format(len(A.key_m.unique())))
print("the table has {} unique keys for `odds`".format(len(A.key_o.unique())))









    



odds has 18448 records
our lookup table is of size 18448
the table has 18448 unique keys for `matches`
the table has 18448 unique keys for `odds`

Now we can assign a single key for both odds and matches which corresponds on the match level. We can also standardize match numbers within tournaments.



In [38]:

    
# take the key originally assigned to `matches` to be the main key
A.rename(columns = {'key_m':'key'},inplace=True)
# change name of `odds` key to match that in `A`
odds.rename(columns = {'key':'key_o'},inplace=True)
# add `matches` key to `odds`, and get rid of `odds` key
odds = pd.merge(odds,A,how='inner',on='key_o')
del odds['key_o']
# use the `odds` match numbers on `matches`
matches = matches.rename(columns={'match_num':'match_num_old'})
matches = pd.merge(matches,odds[['match_num','key']],how='inner',on='key')

Finally, we save our cleansed and matched datasets.



In [39]:

    
matches.to_pickle(pickle_dir + 'matches.pkl')
odds.to_pickle(pickle_dir + 'odds.pkl')



In [ ]:

	ATP	Location	Tournament	Date	Round	Winner	Loser
0	1	Brisbane	Brisbane International	2010-01-04	1st Round	Gasquet R.	Nieminen J.
34	2	Chennai	Chennai Open	2010-01-04	1st Round	Tipsarevic J.	Moya C.
33	2	Chennai	Chennai Open	2010-01-04	1st Round	Giraldo S.	Phau B.
32	2	Chennai	Chennai Open	2010-01-04	1st Round	Hajek J.	Greul S.
31	2	Chennai	Chennai Open	2010-01-04	1st Round	Cilic M.	Kunitsyn I.

	tourney_id	tourney_name	tourney_date	round	winner_name	loser_name
1411	2010-339	Brisbane	2010-01-03	R32	Andy Roddick	Peter Luczak
1440	2010-339	Brisbane	2010-01-03	SF	Radek Stepanek	Gael Monfils
1439	2010-339	Brisbane	2010-01-03	SF	Andy Roddick	Tomas Berdych
1438	2010-339	Brisbane	2010-01-03	QF	Radek Stepanek	Wayne Odesnik
1437	2010-339	Brisbane	2010-01-03	QF	Gael Monfils	James Blake

	m_name	o_name
17	casablanca	casablanca
18	chengdu	chengdu
19	chennai	chennai
20	cincinnati masters	cincinnati
21	costa do sauipe	costa do sauipe
22	delray beach	delray beach
23	doha	doha
24	dubai	dubai
25	dusseldorf	dusseldorf
58	power horse cup	dusseldorf

	ATP	fname	ID	size	round	Location	match_num	Tournament	Date	Series	...	LBL	PSW	PSL	SJW	SJL	MaxW	MaxL	AvgW	AvgL	key
0	1	2010	2010-339	31	R32	brisbane	1	Brisbane International	2010-01-04	ATP250	...	2.375	1.526	2.740	1.500	2.500	NaN	NaN	NaN	NaN	0
1	1	2010	2010-339	31	R32	brisbane	2	Brisbane International	2010-01-04	ATP250	...	1.571	2.140	1.813	2.250	1.571	NaN	NaN	NaN	NaN	1
2	1	2010	2010-339	31	R32	brisbane	3	Brisbane International	2010-01-04	ATP250	...	2.100	1.676	2.360	1.615	2.200	NaN	NaN	NaN	NaN	2
3	1	2010	2010-339	31	R32	brisbane	4	Brisbane International	2010-01-04	ATP250	...	1.500	2.580	1.581	2.500	1.500	NaN	NaN	NaN	NaN	3

	tourney_id	size	tourney_name	surface	draw_size	tourney_level	tourney_date	match_num	winner_id	winner_seed	...	l_df	l_svpt	l_1stIn	l_1stWon	l_2ndWon	l_SvGms	l_bpSaved	l_bpFaced	match_type	key
0	2010-1536	55	madrid masters	Clay	56	M	2010-05-09	1	103794	NaN	...	2.0	46.0	25.0	11.0	8.0	7.0	4.0	9.0	atp_matches	0
1	2010-1536	55	madrid masters	Clay	56	M	2010-05-09	2	104919	NaN	...	7.0	110.0	71.0	52.0	19.0	17.0	1.0	4.0	atp_matches	1
2	2010-1536	55	madrid masters	Clay	56	M	2010-05-09	3	104527	15.0	...	1.0	62.0	33.0	17.0	11.0	10.0	3.0	9.0	atp_matches	2
3	2010-1536	55	madrid masters	Clay	56	M	2010-05-09	4	104022	10.0	...	1.0	73.0	45.0	29.0	12.0	12.0	5.0	10.0	atp_matches	3

	tourney_id	match_num	round	key	0_W	1_W	2_W	3_W	4_W	0_L	1_L	2_L	3_L	4_L
54	2010-1536	55	F	54	rafael	nadal	None	None	None	roger	federer	None	None	None
50	2010-1536	51	QF	50	nicolas	almagro	None	None	None	jurgen	melzer	None	None	None
48	2010-1536	49	QF	48	roger	federer	None	None	None	ernests	gulbis	None	None	None
49	2010-1536	50	QF	49	david	ferrer	None	None	None	andy	murray	None	None	None
51	2010-1536	52	QF	51	rafael	nadal	None	None	None	gael	monfils	None	None	None

	ID	round	match_num	key	0_W	1_W	2_W	3_W	4_W	0_L	1_L	2_L	3_L	4_L
7722	2010-1536	F	55	7722	nadal	r.	None	None	None	federer	r.	None	None	None
7716	2010-1536	QF	49	7716	almagro	n.	None	None	None	melzer	j.	None	None	None
7718	2010-1536	QF	51	7718	federer	r.	None	None	None	gulbis	e.	None	None	None
7719	2010-1536	QF	52	7719	ferrer	d.	None	None	None	murray	a.	None	None	None
7717	2010-1536	QF	50	7717	nadal	r.	None	None	None	monfils	g.	None	None	None

	tourney_id	match_num	round	key	0_W	1_W	2_W	3_W	4_W	0_L	1_L	2_L	3_L	4_L
36	2010-1536	37	R32	36	guillermo	garcia	lopez	None	None	jo	wilfried	tsonga	None	None
28	2010-1536	29	R32	28	andy	murray	None	None	None	juan	ignacio	chela	None	None
6	2010-1536	7	R64	6	juan	ignacio	chela	None	None	pablo	cuevas	None	None	None
11	2010-1536	12	R64	11	eduardo	schwank	None	None	None	paul	henri	mathieu	None	None
57	2010-1720	3	R32	57	olivier	rochus	None	None	None	juan	martin	del	potro	None
101	2010-2276	20	R16	101	illya	marchenko	None	None	None	ivan	dodig	None	None	None
100	2010-2276	19	R16	100	jurgen	melzer	None	None	None	olivier	rochus	None	None	None
82	2010-2276	1	R32	82	marin	cilic	None	None	None	jan	hajek	None	None	None
85	2010-2276	4	R32	85	ivo	karlovic	None	None	None	antonio	veic	None	None	None
127	2010-301	15	R16	127	michael	lammer	None	None	None	juan	carlos	ferrero	None	None

	ID	round	match_num	key	0_W	1_W	2_W	3_W	4_W	0_L	1_L	2_L	3_L	4_L
7707	2010-1536	R32	40	7707	garcia	lopez	g.	None	None	tsonga	j.w.	None	None	None
7705	2010-1536	R32	38	7705	murray	a.	None	None	None	chela	j.i.	None	None	None
7680	2010-1536	R64	13	7680	chela	j.i.	None	None	None	cuevas	p.	None	None	None
7679	2010-1536	R64	12	7679	schwank	e.	None	None	None	mathieu	p.h.	None	None	None
15205	2010-1720	R32	11	15205	rochus	o.	None	None	None	del	potro	j.m.	None	None
2293	2010-2276	R16	18	2293	cilic	m.	None	None	None	hajek	j.	None	None	None
2294	2010-2276	R16	19	2294	karlovic	i.	None	None	None	veic	a.	None	None	None
2289	2010-2276	R32	14	2289	marchenko	i.	None	None	None	dodig	i.	None	None	None
2291	2010-2276	R32	16	2291	melzer	j.	None	None	None	rochus	o.	None	None	None
627	2010-301	R16	13	627	lammer	m.	None	None	None	ferrero	j.c.	None	None	None

	tourney_id	match_num	round	key	0_W	1_W	2_W	3_W	4_W	0_L	1_L	2_L	3_L	4_L
1257	2010-424	1	R32	1257	andy	roddick	None	None	None	ryler	deheart	None	None	None
1585	2010-499	19	R16	1585	benjamin	becker	None	None	None	wayne	odesnik	None	None	None
1861	2010-520	27	R128	1861	carsten	ball	None	None	None	philipp	petzschner	None	None	None
1902	2010-520	68	R64	1902	fabio	fognini	None	None	None	gael	monfils	None	None	None
7129	2012-506	15	R32	7129	federico	delbonis	None	None	None	frederico	gil	None	None	None
7069	2012-505	9	R32	7069	federico	delbonis	None	None	None	nicolas	massu	None	None	None
7078	2012-505	18	R16	7078	federico	delbonis	None	None	None	thomaz	bellucci	None	None	None
3070	2011-321	4	R32	3070	federico	delbonis	None	None	None	florian	mayer	None	None	None
3084	2011-321	18	R16	3084	federico	delbonis	None	None	None	sergiy	stakhovsky	None	None	None
3091	2011-321	25	QF	3091	federico	delbonis	None	None	None	pavol	cervenak	None	None	None

	tourney_id	match_num	round	key	0_W	1_W	2_W	3_W	4_W	0_L	1_L	2_L	3_L	4_L
1272	2010-424	16	R32	1272	fernando	verdasco	None	None	None	yen	hsun	lu	None	None
1585	2010-499	19	R16	1585	benjamin	becker	None	None	None	wayne	odesnik	None	None	None
1861	2010-520	27	R128	1861	carsten	ball	None	None	None	philipp	petzschner	None	None	None
1902	2010-520	68	R64	1902	fabio	fognini	None	None	None	gael	monfils	None	None	None

	ID	round	match_num	key	0_W	1_W	2_W	3_W	4_W	0_L	1_L	2_L	3_L	4_L
2897	2010-424	R16	19	2897	verdasco	f.	None	None	None	lu	y.h.	None	None	None
3943	2010-499	R16	20	3943	odesnik	w.	None	None	None	becker	b.	None	None	None
8349	2010-520	R128	24	8349	petzschner	p.	None	None	None	ball	c.	None	None	None
8407	2010-520	R64	82	8407	monfils	g.	None	None	None	fognini	f.	None	None	None