Capstone 1 Data Wrangling Project

Data Acquisition Summary

A set of .csv files provided for the Kaggle March Machine Learning Mania contest (hereafter referred to as Kaggle data) were downloaded from the Kaggle website (www.kaggle.com). From a college basketball team ratings website (www.kenpom.com) I downloaded a set of .csv files with the final pre-tournament scores on team-level efficiency metrics (Kenpom data). Using the Python package BeautifulSoup, I wrote a script to scrape over 3,000 tables with player-level data from two separate sports statistics websites, www.espn.com and www.sports-reference.com. Scraping was also used to obtain lists of annual player-level All-American award winners and annual team-level data (strength of schedule, etc.) from www.sports-reference.com.

Data Wrangling

The variety of data required a sequence of data wrangling processes to produce the final set of features and labels for the machine learning phase. Several of these data wrangling processes are described below.



In [1]:

    
# importing packages for wrangling tasks
import pandas as pd
import numpy as np
import re
from fuzzywuzzy import process
from fuzzywuzzy import fuzz
from geopy.distance import great_circle

# create a function to quickly tabulate a dataframe column
def tab(dfcol):
    t = pd.crosstab(index=dfcol, columns="count")
    print t
    print t/t.sum()
    
pd.set_option('display.max_columns', None)

Fuzzy matching player names

Different player-level data were available in the two sets of Roster files, which required linking these files player-to-player prior to computing team features.



In [2]:

    
# load the individual .csv files as pandas dataframes
dpath = "C:\Users\mworley\Dropbox\capstone\data"
pstats = pd.read_csv(dpath + r'\external\player_data\player_stats.csv')
ros = pd.read_csv(dpath + r'\external\rosters\rosters.csv')
espn = pd.read_csv(dpath + r'\external\espn\espn_tabs.csv')

# merge the two sportsreference files, player stats (pstats) and positions (ros)
sr_players = pd.merge(pstats, ros, how='inner',
                      on=['Player', 'Team', 'Season'])

After loading the separate sportsreference and espn files, I add some common team identifers to each dataframe (created in a separate process) to make it easier to match the players.



In [3]:

    
# load the team identifying data
teams_seasons = pd.read_csv(dpath + r'\interim\teams_seasons.csv')
team_match = pd.read_csv(dpath + r'\interim\team_match.csv')
team_info = pd.merge(teams_seasons, team_match,
                     how='inner', on='team_id')

# merge team identifier data with player dataframes
sr_players = pd.merge(sr_players, team_info, how='inner',
						left_on=['Team', 'Season'],
						right_on=['srname', 'season'])
espn_players = pd.merge(espn, team_info, how='inner',
						left_on=['espn_id', 'Season'],
						right_on=['espn_id', 'season'])

# keep only columns I need to match players
sr_players = sr_players.loc[:, ['Player', 'srname', 'Season', 'espn_id']]
sr_players.drop_duplicates(inplace=True)
espn_players = espn_players.loc[:, ['Player', 'srname', 'Season', 'espn_id']]
espn_players.drop_duplicates(inplace=True)

# keep only years after 2001 in sportsreference file to match with espn player data
sr_players = sr_players[sr_players['Season'] > 2001]

To match the players in the two different files, I first used a merge with an outer join and an indicator to examine the matches and non-matches after the merge.



In [4]:

    
mrg_players = pd.merge(sr_players, espn_players,
						how='outer',
					   	on=['Player', 'srname' , 'Season', 'espn_id'],
						indicator=True)

Here I used the tab function I created earlier to inspect the proportion of names that were perfect matches and linked from this merge.



In [5]:

    
tab(mrg_players['_merge'])









    



col_0       count
_merge           
left_only     547
right_only    593
both        12950
col_0          count
_merge              
left_only   0.038822
right_only  0.042087
both        0.919092

This merge left around 8% of names unmatched. I needed to separate the nonmatched names, determine why they didn't match, and figure out a way to link them together.

To get a snapshot of the causes of the nonmatches, let's look at a snippet of 10 rows.



In [6]:

    
nomatch = mrg_players[mrg_players['_merge'] != "both"].copy()
nomatch.sort_values(['srname', 'Season'], inplace=True)
nomatch.head(10)









    Out[6]:






  
    
      
      Player
      srname
      Season
      espn_id
      _merge
    
  
  
    
      31
      Brett Mcknight
      akron
      2009
      2006.0
      left_only
    
    
      13497
      Brett McKnight
      akron
      2009
      2006.0
      right_only
    
    
      46
      Brett Mcknight
      akron
      2011
      2006.0
      left_only
    
    
      13498
      Brett McKnight
      akron
      2011
      2006.0
      right_only
    
    
      68
      Adedeji Ibitayo
      akron
      2013
      2006.0
      left_only
    
    
      13499
      Deji Ibitayo
      akron
      2013
      2006.0
      right_only
    
    
      76
      Mo Williams
      alabama
      2002
      333.0
      left_only
    
    
      13500
      Maurice Williams
      alabama
      2002
      333.0
      right_only
    
    
      11537
      Mo Williams
      alabama
      2003
      333.0
      left_only
    
    
      13979
      Maurice Williams
      alabama
      2003
      333.0
      right_only

The same players are present in both files, but they have variations in the name due to punctuation, spelling, or use of nicknames.

To match the remaining names, I used the python package fuzzywuzzy. I used the extractOne function (process.extractOne) which compares a given string to a list of strings supplied to the function, and extracts the one "best match" from the list. The function returns a tuple of the best match and a score that reflects the accuracy of the match.

Here I test the function using names from the table above. I pass it a string to be matched ('Mo Williams'), and a list of 2 strings, the right match ('Maurice Williams') and a different player ('Deji Ibetayo').



In [7]:

    
process.extractOne('Mo Williams', ['Maurice Williams', 'Deji Ibetayo'])









    Out[7]:





('Maurice Williams', 80)

To match all the names, my first strategy was to use the function out-of-the-box and simply pass it all the nonmatched names, but this didn't work well. Here's an example of why:



In [8]:

    
process.extractOne('Mo Williams', ['Maurice Williams', 'John Williams'])









    Out[8]:





('John Williams', 83)

If the function can find a name that is more syntactically similar, it will extract it and miss the correct match. For my purposes, the function might work better if it had fewer options, perhaps a more restrictive list specific to that player.

To solve this problem I wrote a function to identify the player's team and season, and extract only a name from the player's team and season in the espn file.



In [9]:

    
# create dataframe of non-matched player names, separately for each source
nomatch_sr = nomatch[nomatch._merge == "left_only"].copy()
nomatch_sr.drop('_merge', inplace=True, axis=1)
nomatch_espn = nomatch[nomatch._merge == "right_only"].copy()
nomatch_espn.drop('_merge', inplace=True, axis=1)

# group by team and season, create dictionary of non-matched espn names to use in the function
e = nomatch_espn.groupby(['srname','Season'])['Player']
espn_dict = dict(list(e))

# write the function to selectively match using the player's team and season
plist = []
def match_name_team(row):
    try:
        p = row['Player']
        t = row['srname']
        s = row['Season']
        l = espn_dict.get((t, s))
        n, scr = process.extractOne(p, l)
        list =  (p, t, s, n, scr)
        plist.append(list)
    except:
        pass
    

# apply the function to the nonmatched sportsreference player dataframe
nomatch_sr.apply(match_name_team, axis=1)
df = pd.DataFrame(plist, columns=('Player', 'srname', 'Season',
                 'Player_espn', 'score'))

After running the function, I can inspect the matched names to assess the quality of the matches. Here are the names of players with the highest scores, all those below 65.



In [10]:

    
df.sort_values('score', ascending=False).head(10)









    Out[10]:






  
    
      
      Player
      srname
      Season
      Player_espn
      score
    
  
  
    
      0
      Brett Mcknight
      akron
      2009
      Brett McKnight
      100
    
    
      462
      Malcolm Riley
      texas-southern
      2015
      Malcolm Riley
      100
    
    
      72
      Jaquon Parker
      cincinnati
      2012
      JaQuon Parker
      100
    
    
      71
      Jaquon Parker
      cincinnati
      2011
      JaQuon Parker
      100
    
    
      424
      Baye Moussa Keita
      syracuse
      2013
      Baye-Moussa Keita
      100
    
    
      426
      Baye Moussa Keita
      syracuse
      2014
      Baye-Moussa Keita
      100
    
    
      373
      Joey Lepak
      saint-louis
      2014
      Joey LePak
      100
    
    
      429
      Rafael Deleon
      temple
      2008
      Rafael DeLeon
      100
    
    
      370
      Joey Lepak
      saint-louis
      2013
      Joey LePak
      100
    
    
      62
      Roger Moute a Bidias
      california
      2016
      Roger Moute A Bidias
      100

These are names with minor differences in spelling or punctuation that are all matched well by the function.

Here are the names of players with the lowest scores, all those below 65.



In [11]:

    
# inspect low-scoring matches
df[df['score'] < 65].sort_values('score', ascending=False)









    Out[11]:






  
    
      
      Player
      srname
      Season
      Player_espn
      score
    
  
  
    
      22
      Darius Gardner
      american
      2014
      Pee Wee Gardner
      64
    
    
      255
      Trantell Knight
      middle-tennessee
      2013
      Tweety Knight
      64
    
    
      256
      J.R. Lyle
      milwaukee
      2014
      JeVon Lyle
      64
    
    
      363
      Stephan Bennett
      robert-morris
      2015
      Stephan Hawkins
      61
    
    
      438
      Levan Shawn Alston
      temple
      2016
      Shizz Alston Jr.
      58
    
    
      514
      Phillip Perry
      wisconsin
      2007
      Mickey Perry
      56
    
    
      35
      Daylon Guy
      arkansas-little-rock
      2011
      Chuck Guy
      48
    
    
      265
      Brent Arrington
      mississippi-valley-state
      2012
      Roger Hubbard Jr.
      39
    
    
      233
      Josh Heath
      miami-fl
      2016
      James Palmer Jr.
      36
    
    
      456
      Randall Jackson
      texas-southern
      2003
      Michael Sneed
      36
    
    
      431
      Scootie Randall
      temple
      2009
      Rafael DeLeon
      36
    
    
      374
      Royce Simpson
      saint-louis
      2014
      Joey LePak
      35
    
    
      107
      Walter Pitchford
      florida
      2012
      Erik Jay Murphy
      32
    
    
      461
      Eric Washington
      texas-southern
      2014
      Ray Penn Jr.
      31
    
    
      395
      Beau Brown
      south-dakota-state
      2016
      Adam Dykman
      29
    
    
      372
      Austin Eagleton
      saint-louis
      2014
      Mike McCall Jr.
      28
    
    
      444
      John Johnston
      texas-am
      2006
      Kenneth 'Red' White
      26
    
    
      459
      Edward Chung
      texas-southern
      2014
      Ray Penn Jr.
      26
    
    
      460
      Bobby Griffin
      texas-southern
      2014
      Ray Penn Jr.
      25
    
    
      396
      Skyler Flatten
      south-dakota-state
      2016
      Adam Dykman
      24
    
    
      458
      Davonte Chaney
      texas-southern
      2014
      Ray Penn Jr.
      24
    
    
      170
      Mark Johnson
      indiana
      2002
      Scott May
      19
    
    
      90
      Terrell White
      davidson
      2002
      Puff SummÃ¯ss
      8



In [12]:

    
# everything above 50 looks right, how many names are below 50?
len(df[df.score < 50])









    Out[12]:





17

Below 50 look like a reasonable cutoff where pairs may not actually be the same players. Inspecting a handful of these revealed several who only existed in one file but not the other due to being on the official roster but not acquiring any playing time. So they mostly appear to be true nonmatches.



In [13]:

    
# drop matches scoring below 50
df_c50 = df.loc[df.score > 50].copy()



In [14]:

    
# combine the perfect matches and fuzzy matches into one dataframe
df_c50.drop('score', inplace=True, axis=1)
match = mrg_players[mrg_players['_merge'] == "both"].copy()
match.drop(['espn_id', '_merge'], inplace=True, axis=1)
match['Player_espn'] = match.Player
player_match = match.append(df_c50, ignore_index=True)

The result is a table containing each player's name in the sportsreference source and the espn source, even when different names are used. I can test the effectiveness of the matched names by attempting to merge the two sources of data:



In [16]:

    
# re-create sportsreference and espn player dataframes
sr_players = pd.merge(pstats, ros, how='inner',
                      on=['Player', 'Team', 'Season'])
espn = pd.read_csv(dpath + r'\external\espn\espn_tabs.csv')

# merge team identifier data with player dataframes
sr_players = pd.merge(sr_players, team_info, how='inner',
						left_on=['Team', 'Season'],
						right_on=['srname', 'season'])
espn_players = pd.merge(espn, team_info, how='inner',
						left_on=['espn_id', 'Season'],
						right_on=['espn_id', 'season'])

# keep only years after 2001 in sportsreference file to match with espn player data
sr_players = sr_players[sr_players['Season'] > 2001]

# merge sportsreference file with the player name matches
mrg1 = pd.merge(sr_players, player_match, how='inner',
				on=['Player', 'srname', 'Season'])
mrg2 = pd.merge(mrg1, espn_players, how='outer',
               left_on=['Player_espn', 'srname', 'Season'],
               right_on=['Player', 'srname', 'season'],
               indicator=True)



In [17]:

    
tab(mrg2._merge)









    



col_0       count
_merge           
left_only       0
right_only     75
both        13477
col_0          count
_merge              
left_only   0.000000
right_only  0.005534
both        0.994466

Above I can see that 99.4% of the player data is now linked after this merge.

Processing Roster data

After matching on player names, the roster data file describes each player’s descriptive information (class, position, height) and in-game performance over the course of a season (minutes played, points scored, rebounds, assists, etc.).



In [18]:

    
player_match = pd.read_csv(dpath + r'\interim\player_match.csv')
players = pd.merge(pstats, ros, how='inner',
                   on=['Player', 'Team', 'Season'])
players = pd.merge(players, player_match, how='outer',
                    left_on=['Player', 'Season', 'Team'],
                    right_on=['Player', 'Season', 'srname'])
players.drop('srname', inplace=True, axis=1)
ncols = ['Player_espn', 'GPes', 'MINes', 'PPGes', 'RPGes', 'APGes', 'SPGes',
        'BPGes', 'TPGes', 'FGPCTes', 'FTPCTes', '3PTPCTes', 'Teames',
        'espn_id', 'Season']
espn.columns = ncols
players = pd.merge(players, espn, how='outer',
                    left_on=['Player_espn', 'Season'],
                    right_on=['Player_espn', 'Season'])
players = players.dropna(subset = ['Team'])
players = pd.merge(players, team_info, how='inner',
                    left_on=['Team', 'Season'],
                    right_on=['srname', 'season'],
                    indicator=True)

Here are a few sample rows of these data, taken from the 2011-2012 Kentucky Wildcats.



In [77]:

    
mask = (players.team_id == 1246) & (players.Season == 2012)
players[mask]









    Out[77]:






  
    
      
      
      #
      2P
      2P%
      2PA
      3P
      3P%
      3PA
      3PTPCTes
      APGes
      AST
      BLK
      BPGes
      Class
      FG
      FG%
      FGA
      FGPCTes
      FT
      FT%
      FTA
      FTPCTes
      G
      GPes
      Height
      Heightnum
      MINes
      MP
      PF
      PPGes
      PTS
      Player
      Player_espn
      Pos
      RPGes
      SPGes
      STL
      Season
      TOV
      TPGes
      TRB
      Team
      Teames
      _merge
      espn_id_x
      espn_id_y
      match
      phgtmins
      pminpct
      season
      srname
      starter
      team_id
      tmins
      exp
    
  
  
    
      771
      18524
      NaN
      3.0
      0.438
      6.8
      0.7
      0.325
      2.0
      0.325
      4.8
      4.8
      0.3
      0.3
      FR
      3.6
      0.412
      8.8
      0.412
      2.1
      0.714
      3.0
      0.714
      40.0
      40.0
      6-2
      74.0
      32.6
      32.6
      2.3
      10.0
      10.0
      Marquis Teague
      Marquis Teague
      G
      2.5
      0.9
      0.9
      2012
      2.7
      2.7
      2.5
      kentucky
      Kentucky Wildcats
      both
      96.0
      96.0
      match
      11.537064
      0.155906
      2012
      kentucky
      yes
      1246
      209.1
      0.0
    
    
      18520
      NaN
      5.2
      0.653
      7.9
      0.1
      0.150
      0.5
      0.150
      1.3
      1.3
      4.7
      4.7
      FR
      5.3
      0.623
      8.4
      0.623
      3.6
      0.709
      5.1
      0.709
      40.0
      40.0
      6-10
      82.0
      32.0
      32.0
      2.0
      14.2
      14.2
      Anthony Davis
      Anthony Davis
      F
      10.4
      1.4
      1.4
      2012
      1.0
      1.0
      10.4
      kentucky
      Kentucky Wildcats
      both
      96.0
      96.0
      match
      12.549020
      0.153037
      2012
      kentucky
      yes
      1246
      209.1
      0.0
    
    
      18521
      NaN
      2.5
      0.481
      5.2
      1.9
      0.466
      4.1
      0.466
      1.5
      1.5
      0.1
      0.1
      SO
      4.4
      0.474
      9.2
      0.474
      3.1
      0.826
      3.7
      0.826
      40.0
      40.0
      6-4
      76.0
      31.2
      31.2
      1.8
      13.7
      13.7
      Doron Lamb
      Doron Lamb
      G
      2.7
      0.5
      0.5
      2012
      1.1
      1.1
      2.7
      kentucky
      Kentucky Wildcats
      both
      96.0
      96.0
      match
      11.340029
      0.149211
      2012
      kentucky
      yes
      1246
      209.1
      1.0
    
    
      18523
      NaN
      3.7
      0.535
      6.9
      0.3
      0.255
      1.3
      0.255
      1.9
      1.9
      0.9
      0.9
      FR
      4.0
      0.491
      8.2
      0.491
      3.6
      0.745
      4.8
      0.745
      40.0
      40.0
      6-7
      79.0
      31.1
      31.1
      2.4
      11.9
      11.9
      Michael Kidd-Gilchrist
      Michael Kidd-Gilchrist
      F
      7.4
      1.0
      1.0
      2012
      2.2
      2.2
      7.4
      kentucky
      Kentucky Wildcats
      both
      96.0
      96.0
      match
      11.749880
      0.148733
      2012
      kentucky
      yes
      1246
      209.1
      0.0
    
    
      18522
      NaN
      4.2
      0.528
      8.0
      0.4
      0.327
      1.3
      0.327
      1.3
      1.3
      1.8
      1.8
      SO
      4.7
      0.500
      9.3
      0.500
      2.6
      0.627
      4.2
      0.627
      38.0
      38.0
      6-9
      81.0
      29.3
      29.3
      2.4
      12.3
      12.3
      Terrence Jones
      Terrence Jones
      F
      7.2
      1.3
      1.3
      2012
      1.6
      1.6
      7.2
      kentucky
      Kentucky Wildcats
      both
      96.0
      96.0
      match
      11.350072
      0.140124
      2012
      kentucky
      yes
      1246
      209.1
      1.0
    
    
      18525
      NaN
      2.2
      0.569
      3.8
      1.4
      0.376
      3.7
      0.376
      2.1
      2.1
      0.3
      0.3
      SR
      3.6
      0.474
      7.6
      0.474
      1.4
      0.797
      1.7
      0.797
      40.0
      40.0
      6-8
      80.0
      26.1
      26.1
      2.2
      9.9
      9.9
      Darius Miller
      Darius Miller
      G
      2.8
      0.8
      0.8
      2012
      1.5
      1.5
      2.8
      kentucky
      Kentucky Wildcats
      both
      96.0
      96.0
      match
      9.985653
      0.124821
      2012
      kentucky
      no
      1246
      209.1
      3.0
    
    
      18526
      NaN
      0.9
      0.444
      2.0
      0.9
      0.432
      2.0
      0.432
      0.4
      0.4
      0.4
      0.4
      FR
      1.8
      0.438
      4.1
      0.438
      0.6
      0.815
      0.7
      0.815
      40.0
      40.0
      6-10
      82.0
      11.6
      11.6
      1.1
      5.0
      5.0
      Kyle Wiltjer
      Kyle Wiltjer
      F
      1.8
      0.1
      0.1
      2012
      0.7
      0.7
      1.8
      kentucky
      Kentucky Wildcats
      both
      96.0
      96.0
      match
      4.549020
      0.055476
      2012
      kentucky
      no
      1246
      209.1
      0.0
    
    
      18528
      NaN
      0.4
      0.343
      1.1
      0.0
      0.000
      0.0
      0.000
      0.1
      0.1
      0.3
      0.3
      SR
      0.4
      0.333
      1.1
      0.333
      0.1
      0.308
      0.4
      0.308
      33.0
      33.0
      6-11
      83.0
      6.0
      6.0
      0.8
      0.8
      0.8
      Eloy Vargas
      Eloy Vargas
      F
      1.7
      0.1
      0.1
      2012
      0.2
      0.2
      1.7
      kentucky
      Kentucky Wildcats
      both
      96.0
      96.0
      match
      2.381636
      0.028694
      2012
      kentucky
      no
      1246
      209.1
      3.0
    
    
      18530
      NaN
      0.0
      0.000
      0.1
      0.0
      0.000
      0.1
      0.000
      0.1
      0.1
      0.0
      0.0
      SO
      0.0
      0.000
      0.2
      0.000
      0.1
      0.250
      0.4
      0.250
      11.0
      11.0
      6-2
      74.0
      2.8
      2.8
      0.4
      0.1
      0.1
      Jarrod Polson
      Jarrod Polson
      G
      0.4
      0.2
      0.2
      2012
      0.5
      0.5
      0.4
      kentucky
      Kentucky Wildcats
      both
      96.0
      96.0
      match
      0.990913
      0.013391
      2012
      kentucky
      no
      1246
      209.1
      1.0
    
    
      18531
      NaN
      0.0
      0.000
      0.1
      0.0
      NaN
      0.0
      0.000
      0.1
      0.1
      0.0
      0.0
      JR
      0.0
      0.000
      0.1
      0.000
      0.0
      NaN
      0.0
      0.000
      16.0
      16.0
      6-5
      77.0
      2.8
      2.8
      0.0
      0.0
      0.0
      Twany Beckham
      Twany Beckham
      G
      0.5
      0.1
      0.1
      2012
      0.1
      0.1
      0.5
      kentucky
      Kentucky Wildcats
      both
      96.0
      96.0
      match
      1.031086
      0.013391
      2012
      kentucky
      no
      1246
      209.1
      2.0
    
    
      18527
      NaN
      0.5
      0.500
      1.0
      0.0
      NaN
      0.0
      0.000
      0.2
      0.2
      0.0
      0.0
      FR
      0.5
      0.500
      1.0
      0.500
      0.0
      NaN
      0.0
      0.000
      6.0
      6.0
      5-11
      71.0
      2.2
      2.2
      0.0
      1.0
      1.0
      Sam Malone
      Sam Malone
      G
      0.3
      0.0
      0.0
      2012
      0.7
      0.7
      0.3
      kentucky
      Kentucky Wildcats
      both
      96.0
      96.0
      match
      0.747011
      0.010521
      2012
      kentucky
      no
      1246
      209.1
      0.0
    
    
      18529
      NaN
      0.0
      0.000
      0.1
      0.0
      NaN
      0.0
      0.000
      0.0
      0.0
      0.0
      0.0
      FR
      0.0
      0.000
      0.1
      0.000
      0.2
      0.500
      0.3
      0.500
      12.0
      12.0
      5-9
      69.0
      1.4
      1.4
      0.0
      0.2
      0.2
      Brian Long
      Brian Long
      G
      0.2
      0.0
      0.0
      2012
      0.0
      0.0
      0.2
      kentucky
      Kentucky Wildcats
      both
      96.0
      96.0
      match
      0.461980
      0.006695
      2012
      kentucky
      no
      1246
      209.1
      0.0

These data represent season-long averages for individual players in several game-play statistics tracked in basketball, such as minutes played, points, rebounds, and assists. Because my goal is to use this data to predict outcomes of NCAA tournament games, I need to transform these data into team-level features. Below are some of the obstacles in this task and the solutions I created.

Player height is in a string format, and I need to convert to numeric



In [80]:

    
players.loc[mask, ['Player', 'Height']]









    Out[80]:






  
    
      
      
      Player
      Height
    
  
  
    
      771
      18524
      Marquis Teague
      6-2
    
    
      18520
      Anthony Davis
      6-10
    
    
      18521
      Doron Lamb
      6-4
    
    
      18523
      Michael Kidd-Gilchrist
      6-7
    
    
      18522
      Terrence Jones
      6-9
    
    
      18525
      Darius Miller
      6-8
    
    
      18526
      Kyle Wiltjer
      6-10
    
    
      18528
      Eloy Vargas
      6-11
    
    
      18530
      Jarrod Polson
      6-2
    
    
      18531
      Twany Beckham
      6-5
    
    
      18527
      Sam Malone
      5-11
    
    
      18529
      Brian Long
      5-9



In [21]:

    
# change series from object data type to string
players['Height'] = players['Height'].astype(str)

# create a function that converts string height to total inches
def inches(height):
        try:
            f = int(height.split('-')[0])
            i = int(height.split('-')[1])
            return f * 12 + i
        except:
            return np.nan

players['Heightnum'] = players['Height'].apply(inches)









    Out[21]:





0    80.0
1    75.0
2    77.0
3    76.0
4    79.0
Name: Heightnum, dtype: float64



In [81]:

    
players.loc[mask, ['Player', 'Height', 'Heightnum']]









    Out[81]:






  
    
      
      
      Player
      Height
      Heightnum
    
  
  
    
      771
      18524
      Marquis Teague
      6-2
      74.0
    
    
      18520
      Anthony Davis
      6-10
      82.0
    
    
      18521
      Doron Lamb
      6-4
      76.0
    
    
      18523
      Michael Kidd-Gilchrist
      6-7
      79.0
    
    
      18522
      Terrence Jones
      6-9
      81.0
    
    
      18525
      Darius Miller
      6-8
      80.0
    
    
      18526
      Kyle Wiltjer
      6-10
      82.0
    
    
      18528
      Eloy Vargas
      6-11
      83.0
    
    
      18530
      Jarrod Polson
      6-2
      74.0
    
    
      18531
      Twany Beckham
      6-5
      77.0
    
    
      18527
      Sam Malone
      5-11
      71.0
    
    
      18529
      Brian Long
      5-9
      69.0

Now height is numeric and ready for conversion to some team-level feature. In basketball, taller height is usually an advantage, so I'm interested in obtaining a feature that describes the overall height of the team. However, the simple team average is not a good solution, as backups who rarely play would be weighted the same as players who play most of the game, and my goal is to obtain a feature that describes the height of the team during actual game play.

To address this problem, I created function that calculates the team's 'total minutes", then get each player's percentage of minutes, and then create a column that represents each player's height weighted by their minute percentage. I then calculate a team-level feature, which is team total in minutes-weighted height.



In [ ]:

    
def team_minutes(group):
    s = group.name
    # minutes played data only available after 2001
    if s[1] > 2001:
        group['tmins'] = group['MINes'].sum()
        return group
    else:
        return group

players = players.groupby(['team_id', 'Season'], as_index=False).apply(team_minutes)
players['pminpct'] = players.MINes / players.tmins
players['phgtmins'] = players.pminpct * players.Heightnum



In [82]:

    
players.loc[mask, ['Player', 'Heightnum', 'pminpct', 'phgtmins']]









    Out[82]:






  
    
      
      
      Player
      Heightnum
      pminpct
      phgtmins
    
  
  
    
      771
      18524
      Marquis Teague
      74.0
      0.155906
      11.537064
    
    
      18520
      Anthony Davis
      82.0
      0.153037
      12.549020
    
    
      18521
      Doron Lamb
      76.0
      0.149211
      11.340029
    
    
      18523
      Michael Kidd-Gilchrist
      79.0
      0.148733
      11.749880
    
    
      18522
      Terrence Jones
      81.0
      0.140124
      11.350072
    
    
      18525
      Darius Miller
      80.0
      0.124821
      9.985653
    
    
      18526
      Kyle Wiltjer
      82.0
      0.055476
      4.549020
    
    
      18528
      Eloy Vargas
      83.0
      0.028694
      2.381636
    
    
      18530
      Jarrod Polson
      74.0
      0.013391
      0.990913
    
    
      18531
      Twany Beckham
      77.0
      0.013391
      1.031086
    
    
      18527
      Sam Malone
      71.0
      0.010521
      0.747011
    
    
      18529
      Brian Long
      69.0
      0.006695
      0.461980



In [22]:

    
flrmins = players.groupby(['team_id', 'Season'])['phgtmins'].sum().reset_index()



In [23]:

    
flrmins.sort_values(['Season', 'phgtmins'], ascending=False).head(5)

I'm also interested in looking at how scoring is distributed across different groupings of players, such as by starters vs bench players, or guards vs. forwards.

The player data doesn't describe whether a player is a "starter" (in the group of 5 players who starts the game on the floor) or not, but typically the starters are also the 5 players who also play the most minutes. I use team minutes to calculate whether a player is a starter or not.



In [24]:

    
def get_starters(group):
    s = group.name
    if s[1] > 2001:
        group.sort_values('MINes', ascending=False, inplace=True)
        group['starter'] = 'no'
        i = group.columns.get_loc('starter')
        group.iloc[0:5, i] = 'yes'
        return group
    else:
        return group

players = players.groupby(['team_id', 'Season'], as_index=False).apply(get_starters)



In [75]:

    
mask = (players.team_id == 1246) & (players.Season == 2012)
players.loc[mask, ['Player', 'MINes', 'starter', 'PTS']]









    Out[75]:






  
    
      
      
      Player
      MINes
      starter
      PTS
    
  
  
    
      771
      18524
      Marquis Teague
      32.6
      yes
      10.0
    
    
      18520
      Anthony Davis
      32.0
      yes
      14.2
    
    
      18521
      Doron Lamb
      31.2
      yes
      13.7
    
    
      18523
      Michael Kidd-Gilchrist
      31.1
      yes
      11.9
    
    
      18522
      Terrence Jones
      29.3
      yes
      12.3
    
    
      18525
      Darius Miller
      26.1
      no
      9.9
    
    
      18526
      Kyle Wiltjer
      11.6
      no
      5.0
    
    
      18528
      Eloy Vargas
      6.0
      no
      0.8
    
    
      18530
      Jarrod Polson
      2.8
      no
      0.1
    
    
      18531
      Twany Beckham
      2.8
      no
      0.0
    
    
      18527
      Sam Malone
      2.2
      no
      1.0
    
    
      18529
      Brian Long
      1.4
      no
      0.2

Now I can use the starter column to compute some interesting team-level features, such as the percentage of points on the team scored by the bench and the starters.



In [27]:

    
benscr = players.groupby(['team_id', 'Season', 'starter'])['PTS']
benscr = benscr.sum().unstack('starter').reset_index()
benscr['ptspct_bn'] = benscr.no / (benscr.no + benscr.yes)
benscr['ptspct_st'] = 1 - benscr.ptspct_bn
benscr.drop(['no', 'yes'] , inplace=True, axis=1)



In [76]:

    
benscr[(benscr.team_id==1246) & (benscr.Season == 2012)]









    Out[76]:






  
    
      starter
      team_id
      Season
      ptspct_bn
      ptspct_st
    
  
  
    
      362
      1246
      2012
      0.214918
      0.785082

Another valuable piece of information in the player data is "Class" which describes the player's year in school from first-year players ("freshmen") to fourth-year players ("seniors"). Teams comprised of more experienced players may have an advantage.



In [83]:

    
players.loc[mask, ['Player', 'Class']]









    Out[83]:






  
    
      
      
      Player
      Class
    
  
  
    
      771
      18524
      Marquis Teague
      FR
    
    
      18520
      Anthony Davis
      FR
    
    
      18521
      Doron Lamb
      SO
    
    
      18523
      Michael Kidd-Gilchrist
      FR
    
    
      18522
      Terrence Jones
      SO
    
    
      18525
      Darius Miller
      SR
    
    
      18526
      Kyle Wiltjer
      FR
    
    
      18528
      Eloy Vargas
      SR
    
    
      18530
      Jarrod Polson
      SO
    
    
      18531
      Twany Beckham
      JR
    
    
      18527
      Sam Malone
      FR
    
    
      18529
      Brian Long
      FR

Similar to the height data, class is encoded in a string format and I need to convert to numeric before using the data for calcuations. There's also some inconsistency in the labels used that needs to be cleaned up.



In [30]:

    
tab(players.Class)









    



col_0  count
Class       
FR      5294
Fr         1
GR        11
JR      5618
SO      4951
SR      5233
Sr         1
sr         2
col_0     count
Class          
FR     0.250770
Fr     0.000047
GR     0.000521
JR     0.266117
SO     0.234522
SR     0.247880
Sr     0.000047
sr     0.000095



In [31]:

    
players.Class.fillna('', inplace=True)
players.Class = map(str.upper, players.Class)
expdict = {'SR': '3', 'JR': '2', 'GR': '3',
                         'SO': '1', 'FR': '0', 'MISSING': ""}
players["exp"] = players.Class.map(expdict)
players["exp"] = pd.to_numeric(players.exp, errors='coerce')

The steps above create a numeric experience column ('exp') that describes each player's number of years of college basketball experience.



In [85]:

    
players.loc[mask, ['Player', 'Class', 'exp']]









    Out[85]:






  
    
      
      
      Player
      Class
      exp
    
  
  
    
      771
      18524
      Marquis Teague
      FR
      0.0
    
    
      18520
      Anthony Davis
      FR
      0.0
    
    
      18521
      Doron Lamb
      SO
      1.0
    
    
      18523
      Michael Kidd-Gilchrist
      FR
      0.0
    
    
      18522
      Terrence Jones
      SO
      1.0
    
    
      18525
      Darius Miller
      SR
      3.0
    
    
      18526
      Kyle Wiltjer
      FR
      0.0
    
    
      18528
      Eloy Vargas
      SR
      3.0
    
    
      18530
      Jarrod Polson
      SO
      1.0
    
    
      18531
      Twany Beckham
      JR
      2.0
    
    
      18527
      Sam Malone
      FR
      0.0
    
    
      18529
      Brian Long
      FR
      0.0

Now I can compute some interesting team-level features, such as the average experience level for the starting players



In [33]:

    
strtexp = players.groupby(['team_id', 'Season', 'starter'])
strtexp = strtexp['exp'].mean().unstack('starter').reset_index()
strtexp.drop('no' , inplace=True, axis=1)
strtexp.rename(columns={"yes": "strtexp"}, inplace=True)



In [34]:

    
strtexp.head()

After computing all of the features I'm interested in, I merge them together to create the processed team-level file with features computed from the roster-level data.



In [35]:

    
roster_feat = pd.merge(benscr, strtexp, how='outer',
                        on=['team_id', 'Season'])



In [36]:

    
roster_feat.head()

Matching team names

For the data files obtained from Kaggle, consistent use of a unique numeric identifier for all 364 teams allows seamless joins of Kaggle files by team.



In [37]:

    
# read in kaggle team id file
dpath = r'C:\Users\mworley\Dropbox\capstone\data'
teams = pd.read_csv(dpath + r'\raw\Teams.csv')
#C:\Users\mworley\Dropbox\capstone\data\raw\Teams.csv



In [38]:

    
print(len(teams))
teams.head(5)









    



364






    Out[38]:






  
    
      
      Team_Id
      Team_Name
    
  
  
    
      0
      1101
      Abilene Chr
    
    
      1
      1102
      Air Force
    
    
      2
      1103
      Akron
    
    
      3
      1104
      Alabama
    
    
      4
      1105
      Alabama A&M

The kaggle team file has 364 teams, each with a unique numeric id code and name. To assist in maching with external data, Kaggle also provides a file with several common team name alternatives.



In [39]:

    
tm_names = pd.read_csv(dpath + r'\raw\TeamSpellings.csv')



In [40]:

    
tm_names[tm_names.team_id == 1453]









    Out[40]:






  
    
      
      name_spelling
      team_id
    
  
  
    
      286
      green bay
      1453
    
    
      287
      green-bay
      1453
    
    
      961
      wi green bay
      1453
    
    
      973
      wis.-green bay
      1453
    
    
      976
      wisconsin-green bay
      1453
    
    
      977
      wisconsin-green-bay
      1453

In many cases the external data I obtained used unique names that did not match these alternative spellings. For example, UC-Berkeley, California, and Cal Golden Bears were different source identifiers for the same team. To resolve this problem, I needed to create a table linking the Kaggle numeric identifier to the unique identifier in each additional data source.

The list of names from the sportsreference source was scraped from a website where every team in their database is listed. I pull the team name from the URL text. Below is the code used but commented out; I pull the data from a .csv file I saved after the scrape.



In [41]:

    
'''
# get names of teams from sports reference
url = "http://www.sports-reference.com/cbb/schools/"
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')

links = []
for link in soup.find_all('a'):
	links.append(str(link.get('href')))

tlinks = links[31:508]
srteams = map(lambda x: x.split('/')[-2], tlinks)
srteams = pd.DataFrame(srteams)
srteams.columns = ['srname']
#srteams.to_csv(dpath + r'\interim\srnames.csv', index=False)
'''
srteams = pd.read_csv(dpath + r'\interim\srteams.csv')



In [42]:

    
print len(srteams)
print srteams.head()









    



477
              srname
0  abilene-christian
1          air-force
2              akron
3         alabama-am
4            alabama

The sportsreference source has 477 team names. The kaggle file is smaller (364) as it only includes teams who played in at least one NCAA tournament game since 1985. I only need to match the 364 Kaggle team names as this data will be used to predict outcomes of NCAA tournament games.

First I used regular expressions to apply consistent formatting to the kaggle mixed team names and drop duplicates.



In [43]:

    
# adjust kaggle mixed team names to optimally match sports reference tables
srnames = tm_names.copy()
filldash = lambda x: re.sub(r' ', '-', x)
srnames['name_spelling'] = srnames['name_spelling'].apply(filldash)
srnames.rename(columns={"name_spelling": "srname"}, inplace=True)
srnames.drop_duplicates(inplace=True)



In [44]:

    
srnames.head(6)









    Out[44]:






  
    
      
      srname
      team_id
    
  
  
    
      0
      a&m-corpus-chris
      1394
    
    
      1
      abilene-chr
      1101
    
    
      2
      abilene-christian
      1101
    
    
      4
      air-force
      1102
    
    
      6
      akron
      1103
    
    
      7
      alabama
      1104

The result is a file I can use to attempt a merge with the sportsreference source team names, as the school names are formatted similarly with all lower-case characters and hyphenated word gaps.



In [45]:

    
merge_sr = pd.merge(srnames, srteams, how='outer', 
					on='srname',
					indicator=True)



In [46]:

    
tab(merge_sr._merge)
print float(len(merge_sr[merge_sr._merge=='both'])) / 364









    



col_0       count
_merge           
left_only     421
right_only    116
both          361
col_0          count
_merge              
left_only   0.468820
right_only  0.129176
both        0.402004
0.991758241758

The merge matches nearly all of the teams (361/364). Before continuing I save the matches as a dataframe.



In [47]:

    
match = merge_sr[merge_sr._merge == 'both'].copy()
match.drop('_merge', axis=1, inplace=True)

To isolate the kaggle team ids that did not match, I re-merge the team ids with the match file, and then keep the 'left only' rows from the join. I also save the 'right-only' team names for fuzzy string matching later.



In [86]:

    
# get a dataframe of the mixed names limited to non-matched teams
nomatch = pd.merge(srnames, match, how='outer', 
					on=['team_id'], indicator=True)
nomatch = nomatch[nomatch._merge=='left_only']
teams = merge_sr.loc[merge_sr._merge == 'right_only', 'srname']
nomatch.head(len(nomatch))









    Out[86]:






  
    
      
      srname_x
      team_id
      srname_y
      _merge
    
  
  
    
      120
      centenary
      1147
      NaN
      left_only
    
    
      121
      centenary-(la)
      1147
      NaN
      left_only
    
    
      750
      w-salem-st
      1445
      NaN
      left_only
    
    
      751
      winston-salem
      1445
      NaN
      left_only
    
    
      752
      winston-salem-state
      1445
      NaN
      left_only
    
    
      753
      w-texas-a&m
      1446
      NaN
      left_only
    
    
      754
      west-texas-a&m
      1446
      NaN
      left_only

To match the remaining names, I use the series of sportsreference names in a fuzzy string matching function.



In [49]:

    
# create a function to fuzzy match the nonmatched names 
def match_srname(name):
		new_name, score = process.extractOne(name, teams)
		return new_name, score


# run function on kaggle srnames
names, scores = zip(*nomatch['srname_x'].apply(match_srname))
nomatch['name'], nomatch['score'] = names, scores



In [50]:

    
nomatch.sort_values(['team_id', 'score'], ascending=False, inplace=True)



In [51]:

    
nomatch.head(len(nomatch))









    Out[51]:






  
    
      
      srname_x
      team_id
      srname_y
      _merge
      name
      score
    
  
  
    
      754
      west-texas-a&m
      1446
      NaN
      left_only
      west-texas-am
      96
    
    
      753
      w-texas-a&m
      1446
      NaN
      left_only
      west-texas-am
      83
    
    
      750
      w-salem-st
      1445
      NaN
      left_only
      st-johns-college-oh
      86
    
    
      751
      winston-salem
      1445
      NaN
      left_only
      washington-lee
      67
    
    
      752
      winston-salem-state
      1445
      NaN
      left_only
      western-state
      63
    
    
      121
      centenary-(la)
      1147
      NaN
      left_only
      centenary-la
      96
    
    
      120
      centenary
      1147
      NaN
      left_only
      centenary-la
      95

The only remaining non-match (winston-salem-state) was discovered to be an error as it has never qualified for division 1 basketball. I remove it and keep the best match for the other 2 unique team ids, and add to the teams who matched in the merge.



In [52]:

    
nomatch.drop_duplicates(['team_id'], inplace=True)
nomatch = nomatch[nomatch.team_id != 1445]
nomatch.drop(['srname_x', 'srname_y', '_merge', 'score'], axis=1, inplace=True)
nomatch.rename(columns={'name': 'srname'}, inplace=True)
team_match = pd.concat([match, nomatch])



In [53]:

    
len(team_match)









    Out[53]:





363

Computation of team travel distance

NCAA tournament games occur at neutral sites, with each team required to travel some distance to the game. The Kaggle data includes two separate files with the locations (longitude & latitude) of each tournament site and the location of each team’s campus, along with a file for each tournament game.



In [54]:

    
# import the 3 data files
tgames = pd.read_csv(dpath + r'\interim\tourney_games.csv')
gameloc = pd.read_csv(dpath + r'\raw\TourneyGeog.csv')
teamloc = pd.read_csv(dpath + r'\raw\TeamGeog.csv')

After importing the first objective is to integrate the game location with the game results.



In [55]:

    
# some operations on the dataframes to enable the merge
tgames.columns = map(str.lower, tgames.columns)
gameloc.drop('daynum', axis=1, inplace=True)
# replace baton rouge longitude which was discovered to be an error
gameloc.loc[gameloc.host == 'baton_rouge', ['lng']] = -91.19
gameloc.rename(columns={'wteam': 'w_team_id', 'lteam': 'l_team_id'},
             inplace=True)
tgames = pd.merge(tgames, gameloc, how='inner',
                  on=['season', 'w_team_id', 'l_team_id'])

This merge produces a file with the two teams, host city, and venue location for all tournament games.



In [56]:

    
tgames[['w_team_id', 'l_team_id', 'host', 'lat', 'lng']].head(5)









    Out[56]:






  
    
      
      w_team_id
      l_team_id
      host
      lat
      lng
    
  
  
    
      0
      1116
      1234
      tulsa
      36.1314
      -95.9372
    
    
      1
      1120
      1345
      albuquerque
      35.1107
      -106.6100
    
    
      2
      1207
      1250
      hartford
      41.7627
      -72.6743
    
    
      3
      1229
      1425
      dayton
      39.7594
      -84.1917
    
    
      4
      1242
      1325
      albuquerque
      35.1107
      -106.6100

To compute the distance for each team, I merge in the location for each team. I use two separate merges, one for the winning team and one for the losing team.



In [57]:

    
tgames.rename(columns={'lat': 'glat', 'lng': 'glng'}, inplace=True)
tgames = pd.merge(tgames, teamloc, how='inner',
                  left_on='w_team_id', right_on='team_id')
tgames.rename(columns={'lat': 'wlat', 'lng': 'wlng'}, inplace=True)
tgames = pd.merge(tgames, teamloc, how='inner',
                  left_on='l_team_id', right_on='team_id')
tgames.rename(columns={'lat': 'llat', 'lng': 'llng'}, inplace=True)



In [58]:

    
tgames.iloc[0:5, -9:]









    Out[58]:






  
    
      
      host
      glat
      glng
      team_id_x
      wlat
      wlng
      team_id_y
      llat
      llng
    
  
  
    
      0
      tulsa
      36.1314
      -95.9372
      1116
      36.062580
      -94.157426
      1234
      41.661128
      -91.530168
    
    
      1
      salt_lake_city
      40.7500
      -111.8833
      1246
      37.988689
      -84.477715
      1234
      41.661128
      -91.530168
    
    
      2
      uniondale
      40.7031
      -73.5911
      1246
      37.988689
      -84.477715
      1234
      41.661128
      -91.530168
    
    
      3
      seattle
      47.6097
      -122.3331
      1424
      36.114646
      -115.172816
      1234
      41.661128
      -91.530168
    
    
      4
      louisville
      38.2500
      -85.7667
      1181
      36.005168
      -78.940276
      1234
      41.661128
      -91.530168

To compute the distance I need each lattitude and longitude together in a single tuple.



In [59]:

    
tgames['gloc'] = list(zip(tgames.glat, tgames.glng))
tgames['wloc'] = list(zip(tgames.wlat, tgames.wlng))
tgames['lloc'] = list(zip(tgames.llat, tgames.llng))



In [60]:

    
tgames.iloc[0:5, -3:]









    Out[60]:






  
    
      
      gloc
      wloc
      lloc
    
  
  
    
      0
      (36.1314, -95.9372)
      (36.06258, -94.157426)
      (41.661128, -91.530168)
    
    
      1
      (40.75, -111.8833)
      (37.988689, -84.477715)
      (41.661128, -91.530168)
    
    
      2
      (40.7031, -73.5911)
      (37.988689, -84.477715)
      (41.661128, -91.530168)
    
    
      3
      (47.6097, -122.3331)
      (36.114646, -115.172816)
      (41.661128, -91.530168)
    
    
      4
      (38.25, -85.7667)
      (36.005168, -78.940276)
      (41.661128, -91.530168)

Then I use the great_circle function from the geopy.distance package to compute the distance between game location and team location for winning team and losing team.



In [61]:

    
xl = []
yl = []

for i in range(len(tgames)):
	x = int(great_circle(tgames['gloc'][i], tgames['wloc'][i]).miles)
	y = int(great_circle(tgames['gloc'][i], tgames['lloc'][i]).miles)
	xl.append(x)
	yl.append(y)

tgames['w_dist'] = pd.Series(xl).values
tgames['l_dist'] = pd.Series(yl).values



In [62]:

    
tgames.ix[0:5, ['season', 'w_team_id', 'l_team_id', 'w_dist', 'l_dist']]

Identifying and labeling upsets

A perceived “upset” in the NCAA tournament usually refers to the “seed” system used to rank the teams and set the matchups. The 64 teams are divided into four “regions”; in each region teams receive a ranking or “seed” from 1 (best) to 16 (worst). An “upset” occurs when a team with a comparatively higher numerical seed defeats one with a lower seed (12 beats 5, 11 beats 6). The “upset” label is typically reserved for victories by teams with seeds much larger than their opponent.

Kaggle provides data on the final scores of NCAA tournament games, and the seed for each team. To identify upsets, I first merged Kaggle seed data with tournament game data to link the teams in each match-up with their respective tournament seed.



In [63]:

    
# read in data files
dpath = "C:\Users\mworley\Dropbox\capstone\data"
tgames = pd.read_csv(dpath + r'\raw\TourneyCompactResults.csv')
seeds = pd.read_csv(dpath + r'\raw\TourneySeeds.csv')

# add team seeds to tourney games data frame
seeds['Season/Team'] = [(seas, team) for seas,team in zip(seeds.Season, seeds.Team)]
seeds = seeds.set_index('Season/Team').drop(['Season', 'Team'],axis=1).squeeze().to_dict()
tgames['Wteam_seed'] = [seeds[(year,team)] for year,team in zip(tgames.Season,tgames.Wteam)]
tgames['Lteam_seed'] = [seeds[(year,team)] for year,team in zip(tgames.Season,tgames.Lteam)]
tgames['Wteam_seed'] = tgames['Wteam_seed'].str.replace(r'\D+', '').astype('int')
tgames['Lteam_seed'] = tgames['Lteam_seed'].str.replace(r'\D+', '').astype('int')
tgames.columns = map(str.lower, tgames.columns)

Two conditions need to be met for a game to be coded as an upset. Condition 1 is that the game involves opponents with an absolute seed difference greater than 3. Condition 2 is that the team with the higher numeric seed wins the game.

To label upsets I created calculated the absolute seed difference and then created two functions to examine these conditions, and used these functions to create indicator columns in the tournament games dataset.



In [64]:

    
tgames['seedif'] = abs(tgames.wteam_seed - tgames.lteam_seed)

# label each matchup as a potential upset (1) or not (0)
def upset_pot(data):
	if data.seedif > 3:
		return 1
	else:
		return 0

# label each matchup as an upset (1) or not (0)
def upset_label(data):
    x = data.seedif
    if (data.wteam_seed > data.lteam_seed) & (x > 3):
        return 1
    else:
        return 0

# identify potential upsets
# defined as games with seed difference greater than 3
tgames['upsetpot'] = tgames.apply(upset_pot, axis=1)
tgames['upset'] = tgames.apply(upset_label, axis=1)



In [65]:

    
tgames[['wteam_seed', 'lteam_seed', 'seedif', 'upsetpot', 'upset']].head()



In [66]:

    
tab(tgames.upsetpot)









    



col_0     count
upsetpot       
0           708
1          1342
col_0        count
upsetpot          
0         0.345366
1         0.654634

65% of NCAA tournament games qualified as having "upset potential".



In [87]:

    
tab(tgames.upset)









    



col_0  count
upset       
0       1764
1        286
col_0     count
upset          
0      0.860488
1      0.139512

Only 14% of all tournament games resulted in an upset.



In [88]:

    
tab(tgames[tgames.upsetpot==1].upset)









    



col_0  count
upset       
0       1056
1        286
col_0     count
upset          
0      0.786885
1      0.213115

Of the games that had potential for an upset, 21% resulted in an upset.



In [ ]:

	team_id	Season	phgtmins
892	1268	2016	78.900609
405	1181	2016	78.849138
563	1211	2016	78.806509
212	1143	2016	78.638099
1818	1425	2016	78.568829

	Player	srname	Season	espn_id	_merge
31	Brett Mcknight	akron	2009	2006.0	left_only
13497	Brett McKnight	akron	2009	2006.0	right_only
46	Brett Mcknight	akron	2011	2006.0	left_only
13498	Brett McKnight	akron	2011	2006.0	right_only
68	Adedeji Ibitayo	akron	2013	2006.0	left_only
13499	Deji Ibitayo	akron	2013	2006.0	right_only
76	Mo Williams	alabama	2002	333.0	left_only
13500	Maurice Williams	alabama	2002	333.0	right_only
11537	Mo Williams	alabama	2003	333.0	left_only
13979	Maurice Williams	alabama	2003	333.0	right_only

	Player	srname	Season	Player_espn	score
22	Darius Gardner	american	2014	Pee Wee Gardner	64
255	Trantell Knight	middle-tennessee	2013	Tweety Knight	64
256	J.R. Lyle	milwaukee	2014	JeVon Lyle	64
363	Stephan Bennett	robert-morris	2015	Stephan Hawkins	61
438	Levan Shawn Alston	temple	2016	Shizz Alston Jr.	58
514	Phillip Perry	wisconsin	2007	Mickey Perry	56
35	Daylon Guy	arkansas-little-rock	2011	Chuck Guy	48
265	Brent Arrington	mississippi-valley-state	2012	Roger Hubbard Jr.	39
233	Josh Heath	miami-fl	2016	James Palmer Jr.	36
456	Randall Jackson	texas-southern	2003	Michael Sneed	36
431	Scootie Randall	temple	2009	Rafael DeLeon	36
374	Royce Simpson	saint-louis	2014	Joey LePak	35
107	Walter Pitchford	florida	2012	Erik Jay Murphy	32
461	Eric Washington	texas-southern	2014	Ray Penn Jr.	31
395	Beau Brown	south-dakota-state	2016	Adam Dykman	29
372	Austin Eagleton	saint-louis	2014	Mike McCall Jr.	28
444	John Johnston	texas-am	2006	Kenneth 'Red' White	26
459	Edward Chung	texas-southern	2014	Ray Penn Jr.	26
460	Bobby Griffin	texas-southern	2014	Ray Penn Jr.	25
396	Skyler Flatten	south-dakota-state	2016	Adam Dykman	24
458	Davonte Chaney	texas-southern	2014	Ray Penn Jr.	24
170	Mark Johnson	indiana	2002	Scott May	19
90	Terrell White	davidson	2002	Puff SummÃ¯ss	8

		#	2P	2P%	2PA	3P	3P%	3PA	3PTPCTes	APGes	AST	BLK	BPGes	Class	FG	FG%	FGA	FGPCTes	FT	FT%	FTA	FTPCTes	G	GPes	Height	Heightnum	MINes	MP	PF	PPGes	PTS	Player	Player_espn	Pos	RPGes	SPGes	STL	Season	TOV	TPGes	TRB	Team	Teames	_merge	espn_id_x	espn_id_y	match	phgtmins	pminpct	season	srname	starter	team_id	tmins	exp
771	18524	NaN	3.0	0.438	6.8	0.7	0.325	2.0	0.325	4.8	4.8	0.3	0.3	FR	3.6	0.412	8.8	0.412	2.1	0.714	3.0	0.714	40.0	40.0	6-2	74.0	32.6	32.6	2.3	10.0	10.0	Marquis Teague	Marquis Teague	G	2.5	0.9	0.9	2012	2.7	2.7	2.5	kentucky	Kentucky Wildcats	both	96.0	96.0	match	11.537064	0.155906	2012	kentucky	yes	1246	209.1	0.0
	18520	NaN	5.2	0.653	7.9	0.1	0.150	0.5	0.150	1.3	1.3	4.7	4.7	FR	5.3	0.623	8.4	0.623	3.6	0.709	5.1	0.709	40.0	40.0	6-10	82.0	32.0	32.0	2.0	14.2	14.2	Anthony Davis	Anthony Davis	F	10.4	1.4	1.4	2012	1.0	1.0	10.4	kentucky	Kentucky Wildcats	both	96.0	96.0	match	12.549020	0.153037	2012	kentucky	yes	1246	209.1	0.0
	18521	NaN	2.5	0.481	5.2	1.9	0.466	4.1	0.466	1.5	1.5	0.1	0.1	SO	4.4	0.474	9.2	0.474	3.1	0.826	3.7	0.826	40.0	40.0	6-4	76.0	31.2	31.2	1.8	13.7	13.7	Doron Lamb	Doron Lamb	G	2.7	0.5	0.5	2012	1.1	1.1	2.7	kentucky	Kentucky Wildcats	both	96.0	96.0	match	11.340029	0.149211	2012	kentucky	yes	1246	209.1	1.0
	18523	NaN	3.7	0.535	6.9	0.3	0.255	1.3	0.255	1.9	1.9	0.9	0.9	FR	4.0	0.491	8.2	0.491	3.6	0.745	4.8	0.745	40.0	40.0	6-7	79.0	31.1	31.1	2.4	11.9	11.9	Michael Kidd-Gilchrist	Michael Kidd-Gilchrist	F	7.4	1.0	1.0	2012	2.2	2.2	7.4	kentucky	Kentucky Wildcats	both	96.0	96.0	match	11.749880	0.148733	2012	kentucky	yes	1246	209.1	0.0
	18522	NaN	4.2	0.528	8.0	0.4	0.327	1.3	0.327	1.3	1.3	1.8	1.8	SO	4.7	0.500	9.3	0.500	2.6	0.627	4.2	0.627	38.0	38.0	6-9	81.0	29.3	29.3	2.4	12.3	12.3	Terrence Jones	Terrence Jones	F	7.2	1.3	1.3	2012	1.6	1.6	7.2	kentucky	Kentucky Wildcats	both	96.0	96.0	match	11.350072	0.140124	2012	kentucky	yes	1246	209.1	1.0
	18525	NaN	2.2	0.569	3.8	1.4	0.376	3.7	0.376	2.1	2.1	0.3	0.3	SR	3.6	0.474	7.6	0.474	1.4	0.797	1.7	0.797	40.0	40.0	6-8	80.0	26.1	26.1	2.2	9.9	9.9	Darius Miller	Darius Miller	G	2.8	0.8	0.8	2012	1.5	1.5	2.8	kentucky	Kentucky Wildcats	both	96.0	96.0	match	9.985653	0.124821	2012	kentucky	no	1246	209.1	3.0
	18526	NaN	0.9	0.444	2.0	0.9	0.432	2.0	0.432	0.4	0.4	0.4	0.4	FR	1.8	0.438	4.1	0.438	0.6	0.815	0.7	0.815	40.0	40.0	6-10	82.0	11.6	11.6	1.1	5.0	5.0	Kyle Wiltjer	Kyle Wiltjer	F	1.8	0.1	0.1	2012	0.7	0.7	1.8	kentucky	Kentucky Wildcats	both	96.0	96.0	match	4.549020	0.055476	2012	kentucky	no	1246	209.1	0.0
	18528	NaN	0.4	0.343	1.1	0.0	0.000	0.0	0.000	0.1	0.1	0.3	0.3	SR	0.4	0.333	1.1	0.333	0.1	0.308	0.4	0.308	33.0	33.0	6-11	83.0	6.0	6.0	0.8	0.8	0.8	Eloy Vargas	Eloy Vargas	F	1.7	0.1	0.1	2012	0.2	0.2	1.7	kentucky	Kentucky Wildcats	both	96.0	96.0	match	2.381636	0.028694	2012	kentucky	no	1246	209.1	3.0
	18530	NaN	0.0	0.000	0.1	0.0	0.000	0.1	0.000	0.1	0.1	0.0	0.0	SO	0.0	0.000	0.2	0.000	0.1	0.250	0.4	0.250	11.0	11.0	6-2	74.0	2.8	2.8	0.4	0.1	0.1	Jarrod Polson	Jarrod Polson	G	0.4	0.2	0.2	2012	0.5	0.5	0.4	kentucky	Kentucky Wildcats	both	96.0	96.0	match	0.990913	0.013391	2012	kentucky	no	1246	209.1	1.0
	18531	NaN	0.0	0.000	0.1	0.0	NaN	0.0	0.000	0.1	0.1	0.0	0.0	JR	0.0	0.000	0.1	0.000	0.0	NaN	0.0	0.000	16.0	16.0	6-5	77.0	2.8	2.8	0.0	0.0	0.0	Twany Beckham	Twany Beckham	G	0.5	0.1	0.1	2012	0.1	0.1	0.5	kentucky	Kentucky Wildcats	both	96.0	96.0	match	1.031086	0.013391	2012	kentucky	no	1246	209.1	2.0
	18527	NaN	0.5	0.500	1.0	0.0	NaN	0.0	0.000	0.2	0.2	0.0	0.0	FR	0.5	0.500	1.0	0.500	0.0	NaN	0.0	0.000	6.0	6.0	5-11	71.0	2.2	2.2	0.0	1.0	1.0	Sam Malone	Sam Malone	G	0.3	0.0	0.0	2012	0.7	0.7	0.3	kentucky	Kentucky Wildcats	both	96.0	96.0	match	0.747011	0.010521	2012	kentucky	no	1246	209.1	0.0
	18529	NaN	0.0	0.000	0.1	0.0	NaN	0.0	0.000	0.0	0.0	0.0	0.0	FR	0.0	0.000	0.1	0.000	0.2	0.500	0.3	0.500	12.0	12.0	5-9	69.0	1.4	1.4	0.0	0.2	0.2	Brian Long	Brian Long	G	0.2	0.0	0.0	2012	0.0	0.0	0.2	kentucky	Kentucky Wildcats	both	96.0	96.0	match	0.461980	0.006695	2012	kentucky	no	1246	209.1	0.0

starter	team_id	Season	strtexp
0	1102	2004	2.0
1	1102	2006	2.2
2	1103	2009	1.6
3	1103	2011	2.2
4	1103	2013	2.0

starter	team_id	Season	ptspct_bn	ptspct_st	strtexp
0	1102	2004	0.240683	0.759317	2.0
1	1102	2006	0.195369	0.804631	2.2
2	1103	2009	0.430872	0.569128	1.6
3	1103	2011	0.359946	0.640054	2.2
4	1103	2013	0.398487	0.601513	2.0

	Team_Id	Team_Name
0	1101	Abilene Chr
1	1102	Air Force
2	1103	Akron
3	1104	Alabama
4	1105	Alabama A&M

	name_spelling	team_id
286	green bay	1453
287	green-bay	1453
961	wi green bay	1453
973	wis.-green bay	1453
976	wisconsin-green bay	1453
977	wisconsin-green-bay	1453

	srname	team_id
0	a&m-corpus-chris	1394
1	abilene-chr	1101
2	abilene-christian	1101
4	air-force	1102
6	akron	1103
7	alabama	1104

	srname_x	team_id	srname_y	_merge
120	centenary	1147	NaN	left_only
121	centenary-(la)	1147	NaN	left_only
750	w-salem-st	1445	NaN	left_only
751	winston-salem	1445	NaN	left_only
752	winston-salem-state	1445	NaN	left_only
753	w-texas-a&m	1446	NaN	left_only
754	west-texas-a&m	1446	NaN	left_only

	w_team_id	l_team_id	host	lat	lng
0	1116	1234	tulsa	36.1314	-95.9372
1	1120	1345	albuquerque	35.1107	-106.6100
2	1207	1250	hartford	41.7627	-72.6743
3	1229	1425	dayton	39.7594	-84.1917
4	1242	1325	albuquerque	35.1107	-106.6100

	gloc	wloc	lloc
0	(36.1314, -95.9372)	(36.06258, -94.157426)	(41.661128, -91.530168)
1	(40.75, -111.8833)	(37.988689, -84.477715)	(41.661128, -91.530168)
2	(40.7031, -73.5911)	(37.988689, -84.477715)	(41.661128, -91.530168)
3	(47.6097, -122.3331)	(36.114646, -115.172816)	(41.661128, -91.530168)
4	(38.25, -85.7667)	(36.005168, -78.940276)	(41.661128, -91.530168)

	season	w_team_id	l_team_id	w_dist	l_dist
0	1985	1116	1234	99	449
1	1997	1246	1234	1470	1057
2	2001	1246	1234	610	933
3	1987	1424	1234	874	1558
4	1991	1181	1234	406	385
5	1992	1181	1234	49	737