Capstone 1 Data Wrangling Project

Data Acquisition Summary

A set of .csv files provided for the Kaggle March Machine Learning Mania contest (hereafter referred to as Kaggle data) were downloaded from the Kaggle website (www.kaggle.com). From a college basketball team ratings website (www.kenpom.com) I downloaded a set of .csv files with the final pre-tournament scores on team-level efficiency metrics (Kenpom data). Using the Python package BeautifulSoup, I wrote a script to scrape over 3,000 tables with player-level data from two separate sports statistics websites, www.espn.com and www.sports-reference.com. Scraping was also used to obtain lists of annual player-level All-American award winners and annual team-level data (strength of schedule, etc.) from www.sports-reference.com.

Data Wrangling

The variety of data required a sequence of data wrangling processes to produce the final set of features and labels for the machine learning phase. Several of these data wrangling processes are described below.


In [1]:
# importing packages for wrangling tasks
import pandas as pd
import numpy as np
import re
from fuzzywuzzy import process
from fuzzywuzzy import fuzz
from geopy.distance import great_circle

# create a function to quickly tabulate a dataframe column
def tab(dfcol):
    t = pd.crosstab(index=dfcol, columns="count")
    print t
    print t/t.sum()
    
pd.set_option('display.max_columns', None)

Fuzzy matching player names

Different player-level data were available in the two sets of Roster files, which required linking these files player-to-player prior to computing team features.


In [2]:
# load the individual .csv files as pandas dataframes
dpath = "C:\Users\mworley\Dropbox\capstone\data"
pstats = pd.read_csv(dpath + r'\external\player_data\player_stats.csv')
ros = pd.read_csv(dpath + r'\external\rosters\rosters.csv')
espn = pd.read_csv(dpath + r'\external\espn\espn_tabs.csv')

# merge the two sportsreference files, player stats (pstats) and positions (ros)
sr_players = pd.merge(pstats, ros, how='inner',
                      on=['Player', 'Team', 'Season'])

After loading the separate sportsreference and espn files, I add some common team identifers to each dataframe (created in a separate process) to make it easier to match the players.


In [3]:
# load the team identifying data
teams_seasons = pd.read_csv(dpath + r'\interim\teams_seasons.csv')
team_match = pd.read_csv(dpath + r'\interim\team_match.csv')
team_info = pd.merge(teams_seasons, team_match,
                     how='inner', on='team_id')

# merge team identifier data with player dataframes
sr_players = pd.merge(sr_players, team_info, how='inner',
						left_on=['Team', 'Season'],
						right_on=['srname', 'season'])
espn_players = pd.merge(espn, team_info, how='inner',
						left_on=['espn_id', 'Season'],
						right_on=['espn_id', 'season'])

# keep only columns I need to match players
sr_players = sr_players.loc[:, ['Player', 'srname', 'Season', 'espn_id']]
sr_players.drop_duplicates(inplace=True)
espn_players = espn_players.loc[:, ['Player', 'srname', 'Season', 'espn_id']]
espn_players.drop_duplicates(inplace=True)

# keep only years after 2001 in sportsreference file to match with espn player data
sr_players = sr_players[sr_players['Season'] > 2001]

To match the players in the two different files, I first used a merge with an outer join and an indicator to examine the matches and non-matches after the merge.


In [4]:
mrg_players = pd.merge(sr_players, espn_players,
						how='outer',
					   	on=['Player', 'srname' , 'Season', 'espn_id'],
						indicator=True)

Here I used the tab function I created earlier to inspect the proportion of names that were perfect matches and linked from this merge.


In [5]:
tab(mrg_players['_merge'])


col_0       count
_merge           
left_only     547
right_only    593
both        12950
col_0          count
_merge              
left_only   0.038822
right_only  0.042087
both        0.919092

This merge left around 8% of names unmatched. I needed to separate the nonmatched names, determine why they didn't match, and figure out a way to link them together.

To get a snapshot of the causes of the nonmatches, let's look at a snippet of 10 rows.


In [6]:
nomatch = mrg_players[mrg_players['_merge'] != "both"].copy()
nomatch.sort_values(['srname', 'Season'], inplace=True)
nomatch.head(10)


Out[6]:
Player srname Season espn_id _merge
31 Brett Mcknight akron 2009 2006.0 left_only
13497 Brett McKnight akron 2009 2006.0 right_only
46 Brett Mcknight akron 2011 2006.0 left_only
13498 Brett McKnight akron 2011 2006.0 right_only
68 Adedeji Ibitayo akron 2013 2006.0 left_only
13499 Deji Ibitayo akron 2013 2006.0 right_only
76 Mo Williams alabama 2002 333.0 left_only
13500 Maurice Williams alabama 2002 333.0 right_only
11537 Mo Williams alabama 2003 333.0 left_only
13979 Maurice Williams alabama 2003 333.0 right_only

The same players are present in both files, but they have variations in the name due to punctuation, spelling, or use of nicknames.

To match the remaining names, I used the python package fuzzywuzzy. I used the extractOne function (process.extractOne) which compares a given string to a list of strings supplied to the function, and extracts the one "best match" from the list. The function returns a tuple of the best match and a score that reflects the accuracy of the match.

Here I test the function using names from the table above. I pass it a string to be matched ('Mo Williams'), and a list of 2 strings, the right match ('Maurice Williams') and a different player ('Deji Ibetayo').


In [7]:
process.extractOne('Mo Williams', ['Maurice Williams', 'Deji Ibetayo'])


Out[7]:
('Maurice Williams', 80)

To match all the names, my first strategy was to use the function out-of-the-box and simply pass it all the nonmatched names, but this didn't work well. Here's an example of why:


In [8]:
process.extractOne('Mo Williams', ['Maurice Williams', 'John Williams'])


Out[8]:
('John Williams', 83)

If the function can find a name that is more syntactically similar, it will extract it and miss the correct match. For my purposes, the function might work better if it had fewer options, perhaps a more restrictive list specific to that player.

To solve this problem I wrote a function to identify the player's team and season, and extract only a name from the player's team and season in the espn file.


In [9]:
# create dataframe of non-matched player names, separately for each source
nomatch_sr = nomatch[nomatch._merge == "left_only"].copy()
nomatch_sr.drop('_merge', inplace=True, axis=1)
nomatch_espn = nomatch[nomatch._merge == "right_only"].copy()
nomatch_espn.drop('_merge', inplace=True, axis=1)

# group by team and season, create dictionary of non-matched espn names to use in the function
e = nomatch_espn.groupby(['srname','Season'])['Player']
espn_dict = dict(list(e))

# write the function to selectively match using the player's team and season
plist = []
def match_name_team(row):
    try:
        p = row['Player']
        t = row['srname']
        s = row['Season']
        l = espn_dict.get((t, s))
        n, scr = process.extractOne(p, l)
        list =  (p, t, s, n, scr)
        plist.append(list)
    except:
        pass
    

# apply the function to the nonmatched sportsreference player dataframe
nomatch_sr.apply(match_name_team, axis=1)
df = pd.DataFrame(plist, columns=('Player', 'srname', 'Season',
                 'Player_espn', 'score'))

After running the function, I can inspect the matched names to assess the quality of the matches. Here are the names of players with the highest scores, all those below 65.


In [10]:
df.sort_values('score', ascending=False).head(10)


Out[10]:
Player srname Season Player_espn score
0 Brett Mcknight akron 2009 Brett McKnight 100
462 Malcolm Riley texas-southern 2015 Malcolm Riley 100
72 Jaquon Parker cincinnati 2012 JaQuon Parker 100
71 Jaquon Parker cincinnati 2011 JaQuon Parker 100
424 Baye Moussa Keita syracuse 2013 Baye-Moussa Keita 100
426 Baye Moussa Keita syracuse 2014 Baye-Moussa Keita 100
373 Joey Lepak saint-louis 2014 Joey LePak 100
429 Rafael Deleon temple 2008 Rafael DeLeon 100
370 Joey Lepak saint-louis 2013 Joey LePak 100
62 Roger Moute a Bidias california 2016 Roger Moute A Bidias 100

These are names with minor differences in spelling or punctuation that are all matched well by the function.

Here are the names of players with the lowest scores, all those below 65.


In [11]:
# inspect low-scoring matches
df[df['score'] < 65].sort_values('score', ascending=False)


Out[11]:
Player srname Season Player_espn score
22 Darius Gardner american 2014 Pee Wee Gardner 64
255 Trantell Knight middle-tennessee 2013 Tweety Knight 64
256 J.R. Lyle milwaukee 2014 JeVon Lyle 64
363 Stephan Bennett robert-morris 2015 Stephan Hawkins 61
438 Levan Shawn Alston temple 2016 Shizz Alston Jr. 58
514 Phillip Perry wisconsin 2007 Mickey Perry 56
35 Daylon Guy arkansas-little-rock 2011 Chuck Guy 48
265 Brent Arrington mississippi-valley-state 2012 Roger Hubbard Jr. 39
233 Josh Heath miami-fl 2016 James Palmer Jr. 36
456 Randall Jackson texas-southern 2003 Michael Sneed 36
431 Scootie Randall temple 2009 Rafael DeLeon 36
374 Royce Simpson saint-louis 2014 Joey LePak 35
107 Walter Pitchford florida 2012 Erik Jay Murphy 32
461 Eric Washington texas-southern 2014 Ray Penn Jr. 31
395 Beau Brown south-dakota-state 2016 Adam Dykman 29
372 Austin Eagleton saint-louis 2014 Mike McCall Jr. 28
444 John Johnston texas-am 2006 Kenneth 'Red' White 26
459 Edward Chung texas-southern 2014 Ray Penn Jr. 26
460 Bobby Griffin texas-southern 2014 Ray Penn Jr. 25
396 Skyler Flatten south-dakota-state 2016 Adam Dykman 24
458 Davonte Chaney texas-southern 2014 Ray Penn Jr. 24
170 Mark Johnson indiana 2002 Scott May 19
90 Terrell White davidson 2002 Puff Summïss 8

In [12]:
# everything above 50 looks right, how many names are below 50?
len(df[df.score < 50])


Out[12]:
17

Below 50 look like a reasonable cutoff where pairs may not actually be the same players. Inspecting a handful of these revealed several who only existed in one file but not the other due to being on the official roster but not acquiring any playing time. So they mostly appear to be true nonmatches.


In [13]:
# drop matches scoring below 50
df_c50 = df.loc[df.score > 50].copy()

In [14]:
# combine the perfect matches and fuzzy matches into one dataframe
df_c50.drop('score', inplace=True, axis=1)
match = mrg_players[mrg_players['_merge'] == "both"].copy()
match.drop(['espn_id', '_merge'], inplace=True, axis=1)
match['Player_espn'] = match.Player
player_match = match.append(df_c50, ignore_index=True)

The result is a table containing each player's name in the sportsreference source and the espn source, even when different names are used. I can test the effectiveness of the matched names by attempting to merge the two sources of data:


In [16]:
# re-create sportsreference and espn player dataframes
sr_players = pd.merge(pstats, ros, how='inner',
                      on=['Player', 'Team', 'Season'])
espn = pd.read_csv(dpath + r'\external\espn\espn_tabs.csv')

# merge team identifier data with player dataframes
sr_players = pd.merge(sr_players, team_info, how='inner',
						left_on=['Team', 'Season'],
						right_on=['srname', 'season'])
espn_players = pd.merge(espn, team_info, how='inner',
						left_on=['espn_id', 'Season'],
						right_on=['espn_id', 'season'])

# keep only years after 2001 in sportsreference file to match with espn player data
sr_players = sr_players[sr_players['Season'] > 2001]

# merge sportsreference file with the player name matches
mrg1 = pd.merge(sr_players, player_match, how='inner',
				on=['Player', 'srname', 'Season'])
mrg2 = pd.merge(mrg1, espn_players, how='outer',
               left_on=['Player_espn', 'srname', 'Season'],
               right_on=['Player', 'srname', 'season'],
               indicator=True)

In [17]:
tab(mrg2._merge)


col_0       count
_merge           
left_only       0
right_only     75
both        13477
col_0          count
_merge              
left_only   0.000000
right_only  0.005534
both        0.994466

Above I can see that 99.4% of the player data is now linked after this merge.

Processing Roster data

After matching on player names, the roster data file describes each player’s descriptive information (class, position, height) and in-game performance over the course of a season (minutes played, points scored, rebounds, assists, etc.).


In [18]:
player_match = pd.read_csv(dpath + r'\interim\player_match.csv')
players = pd.merge(pstats, ros, how='inner',
                   on=['Player', 'Team', 'Season'])
players = pd.merge(players, player_match, how='outer',
                    left_on=['Player', 'Season', 'Team'],
                    right_on=['Player', 'Season', 'srname'])
players.drop('srname', inplace=True, axis=1)
ncols = ['Player_espn', 'GPes', 'MINes', 'PPGes', 'RPGes', 'APGes', 'SPGes',
        'BPGes', 'TPGes', 'FGPCTes', 'FTPCTes', '3PTPCTes', 'Teames',
        'espn_id', 'Season']
espn.columns = ncols
players = pd.merge(players, espn, how='outer',
                    left_on=['Player_espn', 'Season'],
                    right_on=['Player_espn', 'Season'])
players = players.dropna(subset = ['Team'])
players = pd.merge(players, team_info, how='inner',
                    left_on=['Team', 'Season'],
                    right_on=['srname', 'season'],
                    indicator=True)

Here are a few sample rows of these data, taken from the 2011-2012 Kentucky Wildcats.


In [77]:
mask = (players.team_id == 1246) & (players.Season == 2012)
players[mask]


Out[77]:
# 2P 2P% 2PA 3P 3P% 3PA 3PTPCTes APGes AST BLK BPGes Class FG FG% FGA FGPCTes FT FT% FTA FTPCTes G GPes Height Heightnum MINes MP PF PPGes PTS Player Player_espn Pos RPGes SPGes STL Season TOV TPGes TRB Team Teames _merge espn_id_x espn_id_y match phgtmins pminpct season srname starter team_id tmins exp
771 18524 NaN 3.0 0.438 6.8 0.7 0.325 2.0 0.325 4.8 4.8 0.3 0.3 FR 3.6 0.412 8.8 0.412 2.1 0.714 3.0 0.714 40.0 40.0 6-2 74.0 32.6 32.6 2.3 10.0 10.0 Marquis Teague Marquis Teague G 2.5 0.9 0.9 2012 2.7 2.7 2.5 kentucky Kentucky Wildcats both 96.0 96.0 match 11.537064 0.155906 2012 kentucky yes 1246 209.1 0.0
18520 NaN 5.2 0.653 7.9 0.1 0.150 0.5 0.150 1.3 1.3 4.7 4.7 FR 5.3 0.623 8.4 0.623 3.6 0.709 5.1 0.709 40.0 40.0 6-10 82.0 32.0 32.0 2.0 14.2 14.2 Anthony Davis Anthony Davis F 10.4 1.4 1.4 2012 1.0 1.0 10.4 kentucky Kentucky Wildcats both 96.0 96.0 match 12.549020 0.153037 2012 kentucky yes 1246 209.1 0.0
18521 NaN 2.5 0.481 5.2 1.9 0.466 4.1 0.466 1.5 1.5 0.1 0.1 SO 4.4 0.474 9.2 0.474 3.1 0.826 3.7 0.826 40.0 40.0 6-4 76.0 31.2 31.2 1.8 13.7 13.7 Doron Lamb Doron Lamb G 2.7 0.5 0.5 2012 1.1 1.1 2.7 kentucky Kentucky Wildcats both 96.0 96.0 match 11.340029 0.149211 2012 kentucky yes 1246 209.1 1.0
18523 NaN 3.7 0.535 6.9 0.3 0.255 1.3 0.255 1.9 1.9 0.9 0.9 FR 4.0 0.491 8.2 0.491 3.6 0.745 4.8 0.745 40.0 40.0 6-7 79.0 31.1 31.1 2.4 11.9 11.9 Michael Kidd-Gilchrist Michael Kidd-Gilchrist F 7.4 1.0 1.0 2012 2.2 2.2 7.4 kentucky Kentucky Wildcats both 96.0 96.0 match 11.749880 0.148733 2012 kentucky yes 1246 209.1 0.0
18522 NaN 4.2 0.528 8.0 0.4 0.327 1.3 0.327 1.3 1.3 1.8 1.8 SO 4.7 0.500 9.3 0.500 2.6 0.627 4.2 0.627 38.0 38.0 6-9 81.0 29.3 29.3 2.4 12.3 12.3 Terrence Jones Terrence Jones F 7.2 1.3 1.3 2012 1.6 1.6 7.2 kentucky Kentucky Wildcats both 96.0 96.0 match 11.350072 0.140124 2012 kentucky yes 1246 209.1 1.0
18525 NaN 2.2 0.569 3.8 1.4 0.376 3.7 0.376 2.1 2.1 0.3 0.3 SR 3.6 0.474 7.6 0.474 1.4 0.797 1.7 0.797 40.0 40.0 6-8 80.0 26.1 26.1 2.2 9.9 9.9 Darius Miller Darius Miller G 2.8 0.8 0.8 2012 1.5 1.5 2.8 kentucky Kentucky Wildcats both 96.0 96.0 match 9.985653 0.124821 2012 kentucky no 1246 209.1 3.0
18526 NaN 0.9 0.444 2.0 0.9 0.432 2.0 0.432 0.4 0.4 0.4 0.4 FR 1.8 0.438 4.1 0.438 0.6 0.815 0.7 0.815 40.0 40.0 6-10 82.0 11.6 11.6 1.1 5.0 5.0 Kyle Wiltjer Kyle Wiltjer F 1.8 0.1 0.1 2012 0.7 0.7 1.8 kentucky Kentucky Wildcats both 96.0 96.0 match 4.549020 0.055476 2012 kentucky no 1246 209.1 0.0
18528 NaN 0.4 0.343 1.1 0.0 0.000 0.0 0.000 0.1 0.1 0.3 0.3 SR 0.4 0.333 1.1 0.333 0.1 0.308 0.4 0.308 33.0 33.0 6-11 83.0 6.0 6.0 0.8 0.8 0.8 Eloy Vargas Eloy Vargas F 1.7 0.1 0.1 2012 0.2 0.2 1.7 kentucky Kentucky Wildcats both 96.0 96.0 match 2.381636 0.028694 2012 kentucky no 1246 209.1 3.0
18530 NaN 0.0 0.000 0.1 0.0 0.000 0.1 0.000 0.1 0.1 0.0 0.0 SO 0.0 0.000 0.2 0.000 0.1 0.250 0.4 0.250 11.0 11.0 6-2 74.0 2.8 2.8 0.4 0.1 0.1 Jarrod Polson Jarrod Polson G 0.4 0.2 0.2 2012 0.5 0.5 0.4 kentucky Kentucky Wildcats both 96.0 96.0 match 0.990913 0.013391 2012 kentucky no 1246 209.1 1.0
18531 NaN 0.0 0.000 0.1 0.0 NaN 0.0 0.000 0.1 0.1 0.0 0.0 JR 0.0 0.000 0.1 0.000 0.0 NaN 0.0 0.000 16.0 16.0 6-5 77.0 2.8 2.8 0.0 0.0 0.0 Twany Beckham Twany Beckham G 0.5 0.1 0.1 2012 0.1 0.1 0.5 kentucky Kentucky Wildcats both 96.0 96.0 match 1.031086 0.013391 2012 kentucky no 1246 209.1 2.0
18527 NaN 0.5 0.500 1.0 0.0 NaN 0.0 0.000 0.2 0.2 0.0 0.0 FR 0.5 0.500 1.0 0.500 0.0 NaN 0.0 0.000 6.0 6.0 5-11 71.0 2.2 2.2 0.0 1.0 1.0 Sam Malone Sam Malone G 0.3 0.0 0.0 2012 0.7 0.7 0.3 kentucky Kentucky Wildcats both 96.0 96.0 match 0.747011 0.010521 2012 kentucky no 1246 209.1 0.0
18529 NaN 0.0 0.000 0.1 0.0 NaN 0.0 0.000 0.0 0.0 0.0 0.0 FR 0.0 0.000 0.1 0.000 0.2 0.500 0.3 0.500 12.0 12.0 5-9 69.0 1.4 1.4 0.0 0.2 0.2 Brian Long Brian Long G 0.2 0.0 0.0 2012 0.0 0.0 0.2 kentucky Kentucky Wildcats both 96.0 96.0 match 0.461980 0.006695 2012 kentucky no 1246 209.1 0.0

These data represent season-long averages for individual players in several game-play statistics tracked in basketball, such as minutes played, points, rebounds, and assists. Because my goal is to use this data to predict outcomes of NCAA tournament games, I need to transform these data into team-level features. Below are some of the obstacles in this task and the solutions I created.

Player height is in a string format, and I need to convert to numeric


In [80]:
players.loc[mask, ['Player', 'Height']]


Out[80]:
Player Height
771 18524 Marquis Teague 6-2
18520 Anthony Davis 6-10
18521 Doron Lamb 6-4
18523 Michael Kidd-Gilchrist 6-7
18522 Terrence Jones 6-9
18525 Darius Miller 6-8
18526 Kyle Wiltjer 6-10
18528 Eloy Vargas 6-11
18530 Jarrod Polson 6-2
18531 Twany Beckham 6-5
18527 Sam Malone 5-11
18529 Brian Long 5-9

In [21]:
# change series from object data type to string
players['Height'] = players['Height'].astype(str)

# create a function that converts string height to total inches
def inches(height):
        try:
            f = int(height.split('-')[0])
            i = int(height.split('-')[1])
            return f * 12 + i
        except:
            return np.nan

players['Heightnum'] = players['Height'].apply(inches)


Out[21]:
0    80.0
1    75.0
2    77.0
3    76.0
4    79.0
Name: Heightnum, dtype: float64

In [81]:
players.loc[mask, ['Player', 'Height', 'Heightnum']]


Out[81]:
Player Height Heightnum
771 18524 Marquis Teague 6-2 74.0
18520 Anthony Davis 6-10 82.0
18521 Doron Lamb 6-4 76.0
18523 Michael Kidd-Gilchrist 6-7 79.0
18522 Terrence Jones 6-9 81.0
18525 Darius Miller 6-8 80.0
18526 Kyle Wiltjer 6-10 82.0
18528 Eloy Vargas 6-11 83.0
18530 Jarrod Polson 6-2 74.0
18531 Twany Beckham 6-5 77.0
18527 Sam Malone 5-11 71.0
18529 Brian Long 5-9 69.0

Now height is numeric and ready for conversion to some team-level feature. In basketball, taller height is usually an advantage, so I'm interested in obtaining a feature that describes the overall height of the team. However, the simple team average is not a good solution, as backups who rarely play would be weighted the same as players who play most of the game, and my goal is to obtain a feature that describes the height of the team during actual game play.

To address this problem, I created function that calculates the team's 'total minutes", then get each player's percentage of minutes, and then create a column that represents each player's height weighted by their minute percentage. I then calculate a team-level feature, which is team total in minutes-weighted height.


In [ ]:
def team_minutes(group):
    s = group.name
    # minutes played data only available after 2001
    if s[1] > 2001:
        group['tmins'] = group['MINes'].sum()
        return group
    else:
        return group

players = players.groupby(['team_id', 'Season'], as_index=False).apply(team_minutes)
players['pminpct'] = players.MINes / players.tmins
players['phgtmins'] = players.pminpct * players.Heightnum

In [82]:
players.loc[mask, ['Player', 'Heightnum', 'pminpct', 'phgtmins']]


Out[82]:
Player Heightnum pminpct phgtmins
771 18524 Marquis Teague 74.0 0.155906 11.537064
18520 Anthony Davis 82.0 0.153037 12.549020
18521 Doron Lamb 76.0 0.149211 11.340029
18523 Michael Kidd-Gilchrist 79.0 0.148733 11.749880
18522 Terrence Jones 81.0 0.140124 11.350072
18525 Darius Miller 80.0 0.124821 9.985653
18526 Kyle Wiltjer 82.0 0.055476 4.549020
18528 Eloy Vargas 83.0 0.028694 2.381636
18530 Jarrod Polson 74.0 0.013391 0.990913
18531 Twany Beckham 77.0 0.013391 1.031086
18527 Sam Malone 71.0 0.010521 0.747011
18529 Brian Long 69.0 0.006695 0.461980

In [22]:
flrmins = players.groupby(['team_id', 'Season'])['phgtmins'].sum().reset_index()

In [23]:
flrmins.sort_values(['Season', 'phgtmins'], ascending=False).head(5)


Out[23]:
team_id Season phgtmins
892 1268 2016 78.900609
405 1181 2016 78.849138
563 1211 2016 78.806509
212 1143 2016 78.638099
1818 1425 2016 78.568829

I'm also interested in looking at how scoring is distributed across different groupings of players, such as by starters vs bench players, or guards vs. forwards.

The player data doesn't describe whether a player is a "starter" (in the group of 5 players who starts the game on the floor) or not, but typically the starters are also the 5 players who also play the most minutes. I use team minutes to calculate whether a player is a starter or not.


In [24]:
def get_starters(group):
    s = group.name
    if s[1] > 2001:
        group.sort_values('MINes', ascending=False, inplace=True)
        group['starter'] = 'no'
        i = group.columns.get_loc('starter')
        group.iloc[0:5, i] = 'yes'
        return group
    else:
        return group

players = players.groupby(['team_id', 'Season'], as_index=False).apply(get_starters)

In [75]:
mask = (players.team_id == 1246) & (players.Season == 2012)
players.loc[mask, ['Player', 'MINes', 'starter', 'PTS']]


Out[75]:
Player MINes starter PTS
771 18524 Marquis Teague 32.6 yes 10.0
18520 Anthony Davis 32.0 yes 14.2
18521 Doron Lamb 31.2 yes 13.7
18523 Michael Kidd-Gilchrist 31.1 yes 11.9
18522 Terrence Jones 29.3 yes 12.3
18525 Darius Miller 26.1 no 9.9
18526 Kyle Wiltjer 11.6 no 5.0
18528 Eloy Vargas 6.0 no 0.8
18530 Jarrod Polson 2.8 no 0.1
18531 Twany Beckham 2.8 no 0.0
18527 Sam Malone 2.2 no 1.0
18529 Brian Long 1.4 no 0.2

Now I can use the starter column to compute some interesting team-level features, such as the percentage of points on the team scored by the bench and the starters.


In [27]:
benscr = players.groupby(['team_id', 'Season', 'starter'])['PTS']
benscr = benscr.sum().unstack('starter').reset_index()
benscr['ptspct_bn'] = benscr.no / (benscr.no + benscr.yes)
benscr['ptspct_st'] = 1 - benscr.ptspct_bn
benscr.drop(['no', 'yes'] , inplace=True, axis=1)

In [76]:
benscr[(benscr.team_id==1246) & (benscr.Season == 2012)]


Out[76]:
starter team_id Season ptspct_bn ptspct_st
362 1246 2012 0.214918 0.785082

Another valuable piece of information in the player data is "Class" which describes the player's year in school from first-year players ("freshmen") to fourth-year players ("seniors"). Teams comprised of more experienced players may have an advantage.


In [83]:
players.loc[mask, ['Player', 'Class']]


Out[83]:
Player Class
771 18524 Marquis Teague FR
18520 Anthony Davis FR
18521 Doron Lamb SO
18523 Michael Kidd-Gilchrist FR
18522 Terrence Jones SO
18525 Darius Miller SR
18526 Kyle Wiltjer FR
18528 Eloy Vargas SR
18530 Jarrod Polson SO
18531 Twany Beckham JR
18527 Sam Malone FR
18529 Brian Long FR

Similar to the height data, class is encoded in a string format and I need to convert to numeric before using the data for calcuations. There's also some inconsistency in the labels used that needs to be cleaned up.


In [30]:
tab(players.Class)


col_0  count
Class       
FR      5294
Fr         1
GR        11
JR      5618
SO      4951
SR      5233
Sr         1
sr         2
col_0     count
Class          
FR     0.250770
Fr     0.000047
GR     0.000521
JR     0.266117
SO     0.234522
SR     0.247880
Sr     0.000047
sr     0.000095

In [31]:
players.Class.fillna('', inplace=True)
players.Class = map(str.upper, players.Class)
expdict = {'SR': '3', 'JR': '2', 'GR': '3',
                         'SO': '1', 'FR': '0', 'MISSING': ""}
players["exp"] = players.Class.map(expdict)
players["exp"] = pd.to_numeric(players.exp, errors='coerce')

The steps above create a numeric experience column ('exp') that describes each player's number of years of college basketball experience.


In [85]:
players.loc[mask, ['Player', 'Class', 'exp']]


Out[85]:
Player Class exp
771 18524 Marquis Teague FR 0.0
18520 Anthony Davis FR 0.0
18521 Doron Lamb SO 1.0
18523 Michael Kidd-Gilchrist FR 0.0
18522 Terrence Jones SO 1.0
18525 Darius Miller SR 3.0
18526 Kyle Wiltjer FR 0.0
18528 Eloy Vargas SR 3.0
18530 Jarrod Polson SO 1.0
18531 Twany Beckham JR 2.0
18527 Sam Malone FR 0.0
18529 Brian Long FR 0.0

Now I can compute some interesting team-level features, such as the average experience level for the starting players


In [33]:
strtexp = players.groupby(['team_id', 'Season', 'starter'])
strtexp = strtexp['exp'].mean().unstack('starter').reset_index()
strtexp.drop('no' , inplace=True, axis=1)
strtexp.rename(columns={"yes": "strtexp"}, inplace=True)

In [34]:
strtexp.head()


Out[34]:
starter team_id Season strtexp
0 1102 2004 2.0
1 1102 2006 2.2
2 1103 2009 1.6
3 1103 2011 2.2
4 1103 2013 2.0

After computing all of the features I'm interested in, I merge them together to create the processed team-level file with features computed from the roster-level data.


In [35]:
roster_feat = pd.merge(benscr, strtexp, how='outer',
                        on=['team_id', 'Season'])

In [36]:
roster_feat.head()


Out[36]:
starter team_id Season ptspct_bn ptspct_st strtexp
0 1102 2004 0.240683 0.759317 2.0
1 1102 2006 0.195369 0.804631 2.2
2 1103 2009 0.430872 0.569128 1.6
3 1103 2011 0.359946 0.640054 2.2
4 1103 2013 0.398487 0.601513 2.0

Matching team names

For the data files obtained from Kaggle, consistent use of a unique numeric identifier for all 364 teams allows seamless joins of Kaggle files by team.


In [37]:
# read in kaggle team id file
dpath = r'C:\Users\mworley\Dropbox\capstone\data'
teams = pd.read_csv(dpath + r'\raw\Teams.csv')
#C:\Users\mworley\Dropbox\capstone\data\raw\Teams.csv

In [38]:
print(len(teams))
teams.head(5)


364
Out[38]:
Team_Id Team_Name
0 1101 Abilene Chr
1 1102 Air Force
2 1103 Akron
3 1104 Alabama
4 1105 Alabama A&M

The kaggle team file has 364 teams, each with a unique numeric id code and name. To assist in maching with external data, Kaggle also provides a file with several common team name alternatives.


In [39]:
tm_names = pd.read_csv(dpath + r'\raw\TeamSpellings.csv')

In [40]:
tm_names[tm_names.team_id == 1453]


Out[40]:
name_spelling team_id
286 green bay 1453
287 green-bay 1453
961 wi green bay 1453
973 wis.-green bay 1453
976 wisconsin-green bay 1453
977 wisconsin-green-bay 1453

In many cases the external data I obtained used unique names that did not match these alternative spellings. For example, UC-Berkeley, California, and Cal Golden Bears were different source identifiers for the same team. To resolve this problem, I needed to create a table linking the Kaggle numeric identifier to the unique identifier in each additional data source.

The list of names from the sportsreference source was scraped from a website where every team in their database is listed. I pull the team name from the URL text. Below is the code used but commented out; I pull the data from a .csv file I saved after the scrape.


In [41]:
'''
# get names of teams from sports reference
url = "http://www.sports-reference.com/cbb/schools/"
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')

links = []
for link in soup.find_all('a'):
	links.append(str(link.get('href')))

tlinks = links[31:508]
srteams = map(lambda x: x.split('/')[-2], tlinks)
srteams = pd.DataFrame(srteams)
srteams.columns = ['srname']
#srteams.to_csv(dpath + r'\interim\srnames.csv', index=False)
'''
srteams = pd.read_csv(dpath + r'\interim\srteams.csv')

In [42]:
print len(srteams)
print srteams.head()


477
              srname
0  abilene-christian
1          air-force
2              akron
3         alabama-am
4            alabama

The sportsreference source has 477 team names. The kaggle file is smaller (364) as it only includes teams who played in at least one NCAA tournament game since 1985. I only need to match the 364 Kaggle team names as this data will be used to predict outcomes of NCAA tournament games.

First I used regular expressions to apply consistent formatting to the kaggle mixed team names and drop duplicates.


In [43]:
# adjust kaggle mixed team names to optimally match sports reference tables
srnames = tm_names.copy()
filldash = lambda x: re.sub(r' ', '-', x)
srnames['name_spelling'] = srnames['name_spelling'].apply(filldash)
srnames.rename(columns={"name_spelling": "srname"}, inplace=True)
srnames.drop_duplicates(inplace=True)

In [44]:
srnames.head(6)


Out[44]:
srname team_id
0 a&m-corpus-chris 1394
1 abilene-chr 1101
2 abilene-christian 1101
4 air-force 1102
6 akron 1103
7 alabama 1104

The result is a file I can use to attempt a merge with the sportsreference source team names, as the school names are formatted similarly with all lower-case characters and hyphenated word gaps.


In [45]:
merge_sr = pd.merge(srnames, srteams, how='outer', 
					on='srname',
					indicator=True)

In [46]:
tab(merge_sr._merge)
print float(len(merge_sr[merge_sr._merge=='both'])) / 364


col_0       count
_merge           
left_only     421
right_only    116
both          361
col_0          count
_merge              
left_only   0.468820
right_only  0.129176
both        0.402004
0.991758241758

The merge matches nearly all of the teams (361/364). Before continuing I save the matches as a dataframe.


In [47]:
match = merge_sr[merge_sr._merge == 'both'].copy()
match.drop('_merge', axis=1, inplace=True)

To isolate the kaggle team ids that did not match, I re-merge the team ids with the match file, and then keep the 'left only' rows from the join. I also save the 'right-only' team names for fuzzy string matching later.


In [86]:
# get a dataframe of the mixed names limited to non-matched teams
nomatch = pd.merge(srnames, match, how='outer', 
					on=['team_id'], indicator=True)
nomatch = nomatch[nomatch._merge=='left_only']
teams = merge_sr.loc[merge_sr._merge == 'right_only', 'srname']
nomatch.head(len(nomatch))


Out[86]:
srname_x team_id srname_y _merge
120 centenary 1147 NaN left_only
121 centenary-(la) 1147 NaN left_only
750 w-salem-st 1445 NaN left_only
751 winston-salem 1445 NaN left_only
752 winston-salem-state 1445 NaN left_only
753 w-texas-a&m 1446 NaN left_only
754 west-texas-a&m 1446 NaN left_only

To match the remaining names, I use the series of sportsreference names in a fuzzy string matching function.


In [49]:
# create a function to fuzzy match the nonmatched names 
def match_srname(name):
		new_name, score = process.extractOne(name, teams)
		return new_name, score


# run function on kaggle srnames
names, scores = zip(*nomatch['srname_x'].apply(match_srname))
nomatch['name'], nomatch['score'] = names, scores

In [50]:
nomatch.sort_values(['team_id', 'score'], ascending=False, inplace=True)

In [51]:
nomatch.head(len(nomatch))


Out[51]:
srname_x team_id srname_y _merge name score
754 west-texas-a&m 1446 NaN left_only west-texas-am 96
753 w-texas-a&m 1446 NaN left_only west-texas-am 83
750 w-salem-st 1445 NaN left_only st-johns-college-oh 86
751 winston-salem 1445 NaN left_only washington-lee 67
752 winston-salem-state 1445 NaN left_only western-state 63
121 centenary-(la) 1147 NaN left_only centenary-la 96
120 centenary 1147 NaN left_only centenary-la 95

The only remaining non-match (winston-salem-state) was discovered to be an error as it has never qualified for division 1 basketball. I remove it and keep the best match for the other 2 unique team ids, and add to the teams who matched in the merge.


In [52]:
nomatch.drop_duplicates(['team_id'], inplace=True)
nomatch = nomatch[nomatch.team_id != 1445]
nomatch.drop(['srname_x', 'srname_y', '_merge', 'score'], axis=1, inplace=True)
nomatch.rename(columns={'name': 'srname'}, inplace=True)
team_match = pd.concat([match, nomatch])

In [53]:
len(team_match)


Out[53]:
363

Computation of team travel distance

NCAA tournament games occur at neutral sites, with each team required to travel some distance to the game. The Kaggle data includes two separate files with the locations (longitude & latitude) of each tournament site and the location of each team’s campus, along with a file for each tournament game.


In [54]:
# import the 3 data files
tgames = pd.read_csv(dpath + r'\interim\tourney_games.csv')
gameloc = pd.read_csv(dpath + r'\raw\TourneyGeog.csv')
teamloc = pd.read_csv(dpath + r'\raw\TeamGeog.csv')

After importing the first objective is to integrate the game location with the game results.


In [55]:
# some operations on the dataframes to enable the merge
tgames.columns = map(str.lower, tgames.columns)
gameloc.drop('daynum', axis=1, inplace=True)
# replace baton rouge longitude which was discovered to be an error
gameloc.loc[gameloc.host == 'baton_rouge', ['lng']] = -91.19
gameloc.rename(columns={'wteam': 'w_team_id', 'lteam': 'l_team_id'},
             inplace=True)
tgames = pd.merge(tgames, gameloc, how='inner',
                  on=['season', 'w_team_id', 'l_team_id'])

This merge produces a file with the two teams, host city, and venue location for all tournament games.


In [56]:
tgames[['w_team_id', 'l_team_id', 'host', 'lat', 'lng']].head(5)


Out[56]:
w_team_id l_team_id host lat lng
0 1116 1234 tulsa 36.1314 -95.9372
1 1120 1345 albuquerque 35.1107 -106.6100
2 1207 1250 hartford 41.7627 -72.6743
3 1229 1425 dayton 39.7594 -84.1917
4 1242 1325 albuquerque 35.1107 -106.6100

To compute the distance for each team, I merge in the location for each team. I use two separate merges, one for the winning team and one for the losing team.


In [57]:
tgames.rename(columns={'lat': 'glat', 'lng': 'glng'}, inplace=True)
tgames = pd.merge(tgames, teamloc, how='inner',
                  left_on='w_team_id', right_on='team_id')
tgames.rename(columns={'lat': 'wlat', 'lng': 'wlng'}, inplace=True)
tgames = pd.merge(tgames, teamloc, how='inner',
                  left_on='l_team_id', right_on='team_id')
tgames.rename(columns={'lat': 'llat', 'lng': 'llng'}, inplace=True)

In [58]:
tgames.iloc[0:5, -9:]


Out[58]:
host glat glng team_id_x wlat wlng team_id_y llat llng
0 tulsa 36.1314 -95.9372 1116 36.062580 -94.157426 1234 41.661128 -91.530168
1 salt_lake_city 40.7500 -111.8833 1246 37.988689 -84.477715 1234 41.661128 -91.530168
2 uniondale 40.7031 -73.5911 1246 37.988689 -84.477715 1234 41.661128 -91.530168
3 seattle 47.6097 -122.3331 1424 36.114646 -115.172816 1234 41.661128 -91.530168
4 louisville 38.2500 -85.7667 1181 36.005168 -78.940276 1234 41.661128 -91.530168

To compute the distance I need each lattitude and longitude together in a single tuple.


In [59]:
tgames['gloc'] = list(zip(tgames.glat, tgames.glng))
tgames['wloc'] = list(zip(tgames.wlat, tgames.wlng))
tgames['lloc'] = list(zip(tgames.llat, tgames.llng))

In [60]:
tgames.iloc[0:5, -3:]


Out[60]:
gloc wloc lloc
0 (36.1314, -95.9372) (36.06258, -94.157426) (41.661128, -91.530168)
1 (40.75, -111.8833) (37.988689, -84.477715) (41.661128, -91.530168)
2 (40.7031, -73.5911) (37.988689, -84.477715) (41.661128, -91.530168)
3 (47.6097, -122.3331) (36.114646, -115.172816) (41.661128, -91.530168)
4 (38.25, -85.7667) (36.005168, -78.940276) (41.661128, -91.530168)

Then I use the great_circle function from the geopy.distance package to compute the distance between game location and team location for winning team and losing team.


In [61]:
xl = []
yl = []

for i in range(len(tgames)):
	x = int(great_circle(tgames['gloc'][i], tgames['wloc'][i]).miles)
	y = int(great_circle(tgames['gloc'][i], tgames['lloc'][i]).miles)
	xl.append(x)
	yl.append(y)

tgames['w_dist'] = pd.Series(xl).values
tgames['l_dist'] = pd.Series(yl).values

In [62]:
tgames.ix[0:5, ['season', 'w_team_id', 'l_team_id', 'w_dist', 'l_dist']]


Out[62]:
season w_team_id l_team_id w_dist l_dist
0 1985 1116 1234 99 449
1 1997 1246 1234 1470 1057
2 2001 1246 1234 610 933
3 1987 1424 1234 874 1558
4 1991 1181 1234 406 385
5 1992 1181 1234 49 737

Identifying and labeling upsets

A perceived “upset” in the NCAA tournament usually refers to the “seed” system used to rank the teams and set the matchups. The 64 teams are divided into four “regions”; in each region teams receive a ranking or “seed” from 1 (best) to 16 (worst). An “upset” occurs when a team with a comparatively higher numerical seed defeats one with a lower seed (12 beats 5, 11 beats 6). The “upset” label is typically reserved for victories by teams with seeds much larger than their opponent.

Kaggle provides data on the final scores of NCAA tournament games, and the seed for each team. To identify upsets, I first merged Kaggle seed data with tournament game data to link the teams in each match-up with their respective tournament seed.


In [63]:
# read in data files
dpath = "C:\Users\mworley\Dropbox\capstone\data"
tgames = pd.read_csv(dpath + r'\raw\TourneyCompactResults.csv')
seeds = pd.read_csv(dpath + r'\raw\TourneySeeds.csv')

# add team seeds to tourney games data frame
seeds['Season/Team'] = [(seas, team) for seas,team in zip(seeds.Season, seeds.Team)]
seeds = seeds.set_index('Season/Team').drop(['Season', 'Team'],axis=1).squeeze().to_dict()
tgames['Wteam_seed'] = [seeds[(year,team)] for year,team in zip(tgames.Season,tgames.Wteam)]
tgames['Lteam_seed'] = [seeds[(year,team)] for year,team in zip(tgames.Season,tgames.Lteam)]
tgames['Wteam_seed'] = tgames['Wteam_seed'].str.replace(r'\D+', '').astype('int')
tgames['Lteam_seed'] = tgames['Lteam_seed'].str.replace(r'\D+', '').astype('int')
tgames.columns = map(str.lower, tgames.columns)

Two conditions need to be met for a game to be coded as an upset. Condition 1 is that the game involves opponents with an absolute seed difference greater than 3. Condition 2 is that the team with the higher numeric seed wins the game.

To label upsets I created calculated the absolute seed difference and then created two functions to examine these conditions, and used these functions to create indicator columns in the tournament games dataset.


In [64]:
tgames['seedif'] = abs(tgames.wteam_seed - tgames.lteam_seed)

# label each matchup as a potential upset (1) or not (0)
def upset_pot(data):
	if data.seedif > 3:
		return 1
	else:
		return 0

# label each matchup as an upset (1) or not (0)
def upset_label(data):
    x = data.seedif
    if (data.wteam_seed > data.lteam_seed) & (x > 3):
        return 1
    else:
        return 0

# identify potential upsets
# defined as games with seed difference greater than 3
tgames['upsetpot'] = tgames.apply(upset_pot, axis=1)
tgames['upset'] = tgames.apply(upset_label, axis=1)

In [65]:
tgames[['wteam_seed', 'lteam_seed', 'seedif', 'upsetpot', 'upset']].head()


Out[65]:
wteam_seed lteam_seed seedif upsetpot upset
0 9 8 1 0 0
1 11 6 5 1 1
2 1 16 15 1 0
3 9 8 1 0 0
4 3 14 11 1 0

In [66]:
tab(tgames.upsetpot)


col_0     count
upsetpot       
0           708
1          1342
col_0        count
upsetpot          
0         0.345366
1         0.654634

65% of NCAA tournament games qualified as having "upset potential".


In [87]:
tab(tgames.upset)


col_0  count
upset       
0       1764
1        286
col_0     count
upset          
0      0.860488
1      0.139512

Only 14% of all tournament games resulted in an upset.


In [88]:
tab(tgames[tgames.upsetpot==1].upset)


col_0  count
upset       
0       1056
1        286
col_0     count
upset          
0      0.786885
1      0.213115

Of the games that had potential for an upset, 21% resulted in an upset.


In [ ]: