A set of .csv files provided for the Kaggle March Machine Learning Mania contest (hereafter referred to as Kaggle data) were downloaded from the Kaggle website (www.kaggle.com). From a college basketball team ratings website (www.kenpom.com) I downloaded a set of .csv files with the final pre-tournament scores on team-level efficiency metrics (Kenpom data). Using the Python package BeautifulSoup, I wrote a script to scrape over 3,000 tables with player-level data from two separate sports statistics websites, www.espn.com and www.sports-reference.com. Scraping was also used to obtain lists of annual player-level All-American award winners and annual team-level data (strength of schedule, etc.) from www.sports-reference.com.
In [1]:
# importing packages for wrangling tasks
import pandas as pd
import numpy as np
import re
from fuzzywuzzy import process
from fuzzywuzzy import fuzz
from geopy.distance import great_circle
# create a function to quickly tabulate a dataframe column
def tab(dfcol):
t = pd.crosstab(index=dfcol, columns="count")
print t
print t/t.sum()
pd.set_option('display.max_columns', None)
In [2]:
# load the individual .csv files as pandas dataframes
dpath = "C:\Users\mworley\Dropbox\capstone\data"
pstats = pd.read_csv(dpath + r'\external\player_data\player_stats.csv')
ros = pd.read_csv(dpath + r'\external\rosters\rosters.csv')
espn = pd.read_csv(dpath + r'\external\espn\espn_tabs.csv')
# merge the two sportsreference files, player stats (pstats) and positions (ros)
sr_players = pd.merge(pstats, ros, how='inner',
on=['Player', 'Team', 'Season'])
After loading the separate sportsreference and espn files, I add some common team identifers to each dataframe (created in a separate process) to make it easier to match the players.
In [3]:
# load the team identifying data
teams_seasons = pd.read_csv(dpath + r'\interim\teams_seasons.csv')
team_match = pd.read_csv(dpath + r'\interim\team_match.csv')
team_info = pd.merge(teams_seasons, team_match,
how='inner', on='team_id')
# merge team identifier data with player dataframes
sr_players = pd.merge(sr_players, team_info, how='inner',
left_on=['Team', 'Season'],
right_on=['srname', 'season'])
espn_players = pd.merge(espn, team_info, how='inner',
left_on=['espn_id', 'Season'],
right_on=['espn_id', 'season'])
# keep only columns I need to match players
sr_players = sr_players.loc[:, ['Player', 'srname', 'Season', 'espn_id']]
sr_players.drop_duplicates(inplace=True)
espn_players = espn_players.loc[:, ['Player', 'srname', 'Season', 'espn_id']]
espn_players.drop_duplicates(inplace=True)
# keep only years after 2001 in sportsreference file to match with espn player data
sr_players = sr_players[sr_players['Season'] > 2001]
To match the players in the two different files, I first used a merge with an outer join and an indicator to examine the matches and non-matches after the merge.
In [4]:
mrg_players = pd.merge(sr_players, espn_players,
how='outer',
on=['Player', 'srname' , 'Season', 'espn_id'],
indicator=True)
Here I used the tab function I created earlier to inspect the proportion of names that were perfect matches and linked from this merge.
In [5]:
tab(mrg_players['_merge'])
This merge left around 8% of names unmatched. I needed to separate the nonmatched names, determine why they didn't match, and figure out a way to link them together.
To get a snapshot of the causes of the nonmatches, let's look at a snippet of 10 rows.
In [6]:
nomatch = mrg_players[mrg_players['_merge'] != "both"].copy()
nomatch.sort_values(['srname', 'Season'], inplace=True)
nomatch.head(10)
Out[6]:
The same players are present in both files, but they have variations in the name due to punctuation, spelling, or use of nicknames.
To match the remaining names, I used the python package fuzzywuzzy. I used the extractOne function (process.extractOne) which compares a given string to a list of strings supplied to the function, and extracts the one "best match" from the list. The function returns a tuple of the best match and a score that reflects the accuracy of the match.
Here I test the function using names from the table above. I pass it a string to be matched ('Mo Williams'), and a list of 2 strings, the right match ('Maurice Williams') and a different player ('Deji Ibetayo').
In [7]:
process.extractOne('Mo Williams', ['Maurice Williams', 'Deji Ibetayo'])
Out[7]:
To match all the names, my first strategy was to use the function out-of-the-box and simply pass it all the nonmatched names, but this didn't work well. Here's an example of why:
In [8]:
process.extractOne('Mo Williams', ['Maurice Williams', 'John Williams'])
Out[8]:
If the function can find a name that is more syntactically similar, it will extract it and miss the correct match. For my purposes, the function might work better if it had fewer options, perhaps a more restrictive list specific to that player.
To solve this problem I wrote a function to identify the player's team and season, and extract only a name from the player's team and season in the espn file.
In [9]:
# create dataframe of non-matched player names, separately for each source
nomatch_sr = nomatch[nomatch._merge == "left_only"].copy()
nomatch_sr.drop('_merge', inplace=True, axis=1)
nomatch_espn = nomatch[nomatch._merge == "right_only"].copy()
nomatch_espn.drop('_merge', inplace=True, axis=1)
# group by team and season, create dictionary of non-matched espn names to use in the function
e = nomatch_espn.groupby(['srname','Season'])['Player']
espn_dict = dict(list(e))
# write the function to selectively match using the player's team and season
plist = []
def match_name_team(row):
try:
p = row['Player']
t = row['srname']
s = row['Season']
l = espn_dict.get((t, s))
n, scr = process.extractOne(p, l)
list = (p, t, s, n, scr)
plist.append(list)
except:
pass
# apply the function to the nonmatched sportsreference player dataframe
nomatch_sr.apply(match_name_team, axis=1)
df = pd.DataFrame(plist, columns=('Player', 'srname', 'Season',
'Player_espn', 'score'))
After running the function, I can inspect the matched names to assess the quality of the matches. Here are the names of players with the highest scores, all those below 65.
In [10]:
df.sort_values('score', ascending=False).head(10)
Out[10]:
These are names with minor differences in spelling or punctuation that are all matched well by the function.
Here are the names of players with the lowest scores, all those below 65.
In [11]:
# inspect low-scoring matches
df[df['score'] < 65].sort_values('score', ascending=False)
Out[11]:
In [12]:
# everything above 50 looks right, how many names are below 50?
len(df[df.score < 50])
Out[12]:
Below 50 look like a reasonable cutoff where pairs may not actually be the same players. Inspecting a handful of these revealed several who only existed in one file but not the other due to being on the official roster but not acquiring any playing time. So they mostly appear to be true nonmatches.
In [13]:
# drop matches scoring below 50
df_c50 = df.loc[df.score > 50].copy()
In [14]:
# combine the perfect matches and fuzzy matches into one dataframe
df_c50.drop('score', inplace=True, axis=1)
match = mrg_players[mrg_players['_merge'] == "both"].copy()
match.drop(['espn_id', '_merge'], inplace=True, axis=1)
match['Player_espn'] = match.Player
player_match = match.append(df_c50, ignore_index=True)
The result is a table containing each player's name in the sportsreference source and the espn source, even when different names are used. I can test the effectiveness of the matched names by attempting to merge the two sources of data:
In [16]:
# re-create sportsreference and espn player dataframes
sr_players = pd.merge(pstats, ros, how='inner',
on=['Player', 'Team', 'Season'])
espn = pd.read_csv(dpath + r'\external\espn\espn_tabs.csv')
# merge team identifier data with player dataframes
sr_players = pd.merge(sr_players, team_info, how='inner',
left_on=['Team', 'Season'],
right_on=['srname', 'season'])
espn_players = pd.merge(espn, team_info, how='inner',
left_on=['espn_id', 'Season'],
right_on=['espn_id', 'season'])
# keep only years after 2001 in sportsreference file to match with espn player data
sr_players = sr_players[sr_players['Season'] > 2001]
# merge sportsreference file with the player name matches
mrg1 = pd.merge(sr_players, player_match, how='inner',
on=['Player', 'srname', 'Season'])
mrg2 = pd.merge(mrg1, espn_players, how='outer',
left_on=['Player_espn', 'srname', 'Season'],
right_on=['Player', 'srname', 'season'],
indicator=True)
In [17]:
tab(mrg2._merge)
Above I can see that 99.4% of the player data is now linked after this merge.
In [18]:
player_match = pd.read_csv(dpath + r'\interim\player_match.csv')
players = pd.merge(pstats, ros, how='inner',
on=['Player', 'Team', 'Season'])
players = pd.merge(players, player_match, how='outer',
left_on=['Player', 'Season', 'Team'],
right_on=['Player', 'Season', 'srname'])
players.drop('srname', inplace=True, axis=1)
ncols = ['Player_espn', 'GPes', 'MINes', 'PPGes', 'RPGes', 'APGes', 'SPGes',
'BPGes', 'TPGes', 'FGPCTes', 'FTPCTes', '3PTPCTes', 'Teames',
'espn_id', 'Season']
espn.columns = ncols
players = pd.merge(players, espn, how='outer',
left_on=['Player_espn', 'Season'],
right_on=['Player_espn', 'Season'])
players = players.dropna(subset = ['Team'])
players = pd.merge(players, team_info, how='inner',
left_on=['Team', 'Season'],
right_on=['srname', 'season'],
indicator=True)
Here are a few sample rows of these data, taken from the 2011-2012 Kentucky Wildcats.
In [77]:
mask = (players.team_id == 1246) & (players.Season == 2012)
players[mask]
Out[77]:
These data represent season-long averages for individual players in several game-play statistics tracked in basketball, such as minutes played, points, rebounds, and assists. Because my goal is to use this data to predict outcomes of NCAA tournament games, I need to transform these data into team-level features. Below are some of the obstacles in this task and the solutions I created.
Player height is in a string format, and I need to convert to numeric
In [80]:
players.loc[mask, ['Player', 'Height']]
Out[80]:
In [21]:
# change series from object data type to string
players['Height'] = players['Height'].astype(str)
# create a function that converts string height to total inches
def inches(height):
try:
f = int(height.split('-')[0])
i = int(height.split('-')[1])
return f * 12 + i
except:
return np.nan
players['Heightnum'] = players['Height'].apply(inches)
Out[21]:
In [81]:
players.loc[mask, ['Player', 'Height', 'Heightnum']]
Out[81]:
Now height is numeric and ready for conversion to some team-level feature. In basketball, taller height is usually an advantage, so I'm interested in obtaining a feature that describes the overall height of the team. However, the simple team average is not a good solution, as backups who rarely play would be weighted the same as players who play most of the game, and my goal is to obtain a feature that describes the height of the team during actual game play.
To address this problem, I created function that calculates the team's 'total minutes", then get each player's percentage of minutes, and then create a column that represents each player's height weighted by their minute percentage. I then calculate a team-level feature, which is team total in minutes-weighted height.
In [ ]:
def team_minutes(group):
s = group.name
# minutes played data only available after 2001
if s[1] > 2001:
group['tmins'] = group['MINes'].sum()
return group
else:
return group
players = players.groupby(['team_id', 'Season'], as_index=False).apply(team_minutes)
players['pminpct'] = players.MINes / players.tmins
players['phgtmins'] = players.pminpct * players.Heightnum
In [82]:
players.loc[mask, ['Player', 'Heightnum', 'pminpct', 'phgtmins']]
Out[82]:
In [22]:
flrmins = players.groupby(['team_id', 'Season'])['phgtmins'].sum().reset_index()
In [23]:
flrmins.sort_values(['Season', 'phgtmins'], ascending=False).head(5)
Out[23]:
I'm also interested in looking at how scoring is distributed across different groupings of players, such as by starters vs bench players, or guards vs. forwards.
The player data doesn't describe whether a player is a "starter" (in the group of 5 players who starts the game on the floor) or not, but typically the starters are also the 5 players who also play the most minutes. I use team minutes to calculate whether a player is a starter or not.
In [24]:
def get_starters(group):
s = group.name
if s[1] > 2001:
group.sort_values('MINes', ascending=False, inplace=True)
group['starter'] = 'no'
i = group.columns.get_loc('starter')
group.iloc[0:5, i] = 'yes'
return group
else:
return group
players = players.groupby(['team_id', 'Season'], as_index=False).apply(get_starters)
In [75]:
mask = (players.team_id == 1246) & (players.Season == 2012)
players.loc[mask, ['Player', 'MINes', 'starter', 'PTS']]
Out[75]:
Now I can use the starter column to compute some interesting team-level features, such as the percentage of points on the team scored by the bench and the starters.
In [27]:
benscr = players.groupby(['team_id', 'Season', 'starter'])['PTS']
benscr = benscr.sum().unstack('starter').reset_index()
benscr['ptspct_bn'] = benscr.no / (benscr.no + benscr.yes)
benscr['ptspct_st'] = 1 - benscr.ptspct_bn
benscr.drop(['no', 'yes'] , inplace=True, axis=1)
In [76]:
benscr[(benscr.team_id==1246) & (benscr.Season == 2012)]
Out[76]:
Another valuable piece of information in the player data is "Class" which describes the player's year in school from first-year players ("freshmen") to fourth-year players ("seniors"). Teams comprised of more experienced players may have an advantage.
In [83]:
players.loc[mask, ['Player', 'Class']]
Out[83]:
Similar to the height data, class is encoded in a string format and I need to convert to numeric before using the data for calcuations. There's also some inconsistency in the labels used that needs to be cleaned up.
In [30]:
tab(players.Class)
In [31]:
players.Class.fillna('', inplace=True)
players.Class = map(str.upper, players.Class)
expdict = {'SR': '3', 'JR': '2', 'GR': '3',
'SO': '1', 'FR': '0', 'MISSING': ""}
players["exp"] = players.Class.map(expdict)
players["exp"] = pd.to_numeric(players.exp, errors='coerce')
The steps above create a numeric experience column ('exp') that describes each player's number of years of college basketball experience.
In [85]:
players.loc[mask, ['Player', 'Class', 'exp']]
Out[85]:
Now I can compute some interesting team-level features, such as the average experience level for the starting players
In [33]:
strtexp = players.groupby(['team_id', 'Season', 'starter'])
strtexp = strtexp['exp'].mean().unstack('starter').reset_index()
strtexp.drop('no' , inplace=True, axis=1)
strtexp.rename(columns={"yes": "strtexp"}, inplace=True)
In [34]:
strtexp.head()
Out[34]:
After computing all of the features I'm interested in, I merge them together to create the processed team-level file with features computed from the roster-level data.
In [35]:
roster_feat = pd.merge(benscr, strtexp, how='outer',
on=['team_id', 'Season'])
In [36]:
roster_feat.head()
Out[36]:
In [37]:
# read in kaggle team id file
dpath = r'C:\Users\mworley\Dropbox\capstone\data'
teams = pd.read_csv(dpath + r'\raw\Teams.csv')
#C:\Users\mworley\Dropbox\capstone\data\raw\Teams.csv
In [38]:
print(len(teams))
teams.head(5)
Out[38]:
The kaggle team file has 364 teams, each with a unique numeric id code and name. To assist in maching with external data, Kaggle also provides a file with several common team name alternatives.
In [39]:
tm_names = pd.read_csv(dpath + r'\raw\TeamSpellings.csv')
In [40]:
tm_names[tm_names.team_id == 1453]
Out[40]:
In many cases the external data I obtained used unique names that did not match these alternative spellings. For example, UC-Berkeley, California, and Cal Golden Bears were different source identifiers for the same team. To resolve this problem, I needed to create a table linking the Kaggle numeric identifier to the unique identifier in each additional data source.
The list of names from the sportsreference source was scraped from a website where every team in their database is listed. I pull the team name from the URL text. Below is the code used but commented out; I pull the data from a .csv file I saved after the scrape.
In [41]:
'''
# get names of teams from sports reference
url = "http://www.sports-reference.com/cbb/schools/"
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
links = []
for link in soup.find_all('a'):
links.append(str(link.get('href')))
tlinks = links[31:508]
srteams = map(lambda x: x.split('/')[-2], tlinks)
srteams = pd.DataFrame(srteams)
srteams.columns = ['srname']
#srteams.to_csv(dpath + r'\interim\srnames.csv', index=False)
'''
srteams = pd.read_csv(dpath + r'\interim\srteams.csv')
In [42]:
print len(srteams)
print srteams.head()
The sportsreference source has 477 team names. The kaggle file is smaller (364) as it only includes teams who played in at least one NCAA tournament game since 1985. I only need to match the 364 Kaggle team names as this data will be used to predict outcomes of NCAA tournament games.
First I used regular expressions to apply consistent formatting to the kaggle mixed team names and drop duplicates.
In [43]:
# adjust kaggle mixed team names to optimally match sports reference tables
srnames = tm_names.copy()
filldash = lambda x: re.sub(r' ', '-', x)
srnames['name_spelling'] = srnames['name_spelling'].apply(filldash)
srnames.rename(columns={"name_spelling": "srname"}, inplace=True)
srnames.drop_duplicates(inplace=True)
In [44]:
srnames.head(6)
Out[44]:
The result is a file I can use to attempt a merge with the sportsreference source team names, as the school names are formatted similarly with all lower-case characters and hyphenated word gaps.
In [45]:
merge_sr = pd.merge(srnames, srteams, how='outer',
on='srname',
indicator=True)
In [46]:
tab(merge_sr._merge)
print float(len(merge_sr[merge_sr._merge=='both'])) / 364
The merge matches nearly all of the teams (361/364). Before continuing I save the matches as a dataframe.
In [47]:
match = merge_sr[merge_sr._merge == 'both'].copy()
match.drop('_merge', axis=1, inplace=True)
To isolate the kaggle team ids that did not match, I re-merge the team ids with the match file, and then keep the 'left only' rows from the join. I also save the 'right-only' team names for fuzzy string matching later.
In [86]:
# get a dataframe of the mixed names limited to non-matched teams
nomatch = pd.merge(srnames, match, how='outer',
on=['team_id'], indicator=True)
nomatch = nomatch[nomatch._merge=='left_only']
teams = merge_sr.loc[merge_sr._merge == 'right_only', 'srname']
nomatch.head(len(nomatch))
Out[86]:
To match the remaining names, I use the series of sportsreference names in a fuzzy string matching function.
In [49]:
# create a function to fuzzy match the nonmatched names
def match_srname(name):
new_name, score = process.extractOne(name, teams)
return new_name, score
# run function on kaggle srnames
names, scores = zip(*nomatch['srname_x'].apply(match_srname))
nomatch['name'], nomatch['score'] = names, scores
In [50]:
nomatch.sort_values(['team_id', 'score'], ascending=False, inplace=True)
In [51]:
nomatch.head(len(nomatch))
Out[51]:
The only remaining non-match (winston-salem-state) was discovered to be an error as it has never qualified for division 1 basketball. I remove it and keep the best match for the other 2 unique team ids, and add to the teams who matched in the merge.
In [52]:
nomatch.drop_duplicates(['team_id'], inplace=True)
nomatch = nomatch[nomatch.team_id != 1445]
nomatch.drop(['srname_x', 'srname_y', '_merge', 'score'], axis=1, inplace=True)
nomatch.rename(columns={'name': 'srname'}, inplace=True)
team_match = pd.concat([match, nomatch])
In [53]:
len(team_match)
Out[53]:
NCAA tournament games occur at neutral sites, with each team required to travel some distance to the game. The Kaggle data includes two separate files with the locations (longitude & latitude) of each tournament site and the location of each team’s campus, along with a file for each tournament game.
In [54]:
# import the 3 data files
tgames = pd.read_csv(dpath + r'\interim\tourney_games.csv')
gameloc = pd.read_csv(dpath + r'\raw\TourneyGeog.csv')
teamloc = pd.read_csv(dpath + r'\raw\TeamGeog.csv')
After importing the first objective is to integrate the game location with the game results.
In [55]:
# some operations on the dataframes to enable the merge
tgames.columns = map(str.lower, tgames.columns)
gameloc.drop('daynum', axis=1, inplace=True)
# replace baton rouge longitude which was discovered to be an error
gameloc.loc[gameloc.host == 'baton_rouge', ['lng']] = -91.19
gameloc.rename(columns={'wteam': 'w_team_id', 'lteam': 'l_team_id'},
inplace=True)
tgames = pd.merge(tgames, gameloc, how='inner',
on=['season', 'w_team_id', 'l_team_id'])
This merge produces a file with the two teams, host city, and venue location for all tournament games.
In [56]:
tgames[['w_team_id', 'l_team_id', 'host', 'lat', 'lng']].head(5)
Out[56]:
To compute the distance for each team, I merge in the location for each team. I use two separate merges, one for the winning team and one for the losing team.
In [57]:
tgames.rename(columns={'lat': 'glat', 'lng': 'glng'}, inplace=True)
tgames = pd.merge(tgames, teamloc, how='inner',
left_on='w_team_id', right_on='team_id')
tgames.rename(columns={'lat': 'wlat', 'lng': 'wlng'}, inplace=True)
tgames = pd.merge(tgames, teamloc, how='inner',
left_on='l_team_id', right_on='team_id')
tgames.rename(columns={'lat': 'llat', 'lng': 'llng'}, inplace=True)
In [58]:
tgames.iloc[0:5, -9:]
Out[58]:
To compute the distance I need each lattitude and longitude together in a single tuple.
In [59]:
tgames['gloc'] = list(zip(tgames.glat, tgames.glng))
tgames['wloc'] = list(zip(tgames.wlat, tgames.wlng))
tgames['lloc'] = list(zip(tgames.llat, tgames.llng))
In [60]:
tgames.iloc[0:5, -3:]
Out[60]:
Then I use the great_circle function from the geopy.distance package to compute the distance between game location and team location for winning team and losing team.
In [61]:
xl = []
yl = []
for i in range(len(tgames)):
x = int(great_circle(tgames['gloc'][i], tgames['wloc'][i]).miles)
y = int(great_circle(tgames['gloc'][i], tgames['lloc'][i]).miles)
xl.append(x)
yl.append(y)
tgames['w_dist'] = pd.Series(xl).values
tgames['l_dist'] = pd.Series(yl).values
In [62]:
tgames.ix[0:5, ['season', 'w_team_id', 'l_team_id', 'w_dist', 'l_dist']]
Out[62]:
A perceived “upset” in the NCAA tournament usually refers to the “seed” system used to rank the teams and set the matchups. The 64 teams are divided into four “regions”; in each region teams receive a ranking or “seed” from 1 (best) to 16 (worst). An “upset” occurs when a team with a comparatively higher numerical seed defeats one with a lower seed (12 beats 5, 11 beats 6). The “upset” label is typically reserved for victories by teams with seeds much larger than their opponent.
Kaggle provides data on the final scores of NCAA tournament games, and the seed for each team. To identify upsets, I first merged Kaggle seed data with tournament game data to link the teams in each match-up with their respective tournament seed.
In [63]:
# read in data files
dpath = "C:\Users\mworley\Dropbox\capstone\data"
tgames = pd.read_csv(dpath + r'\raw\TourneyCompactResults.csv')
seeds = pd.read_csv(dpath + r'\raw\TourneySeeds.csv')
# add team seeds to tourney games data frame
seeds['Season/Team'] = [(seas, team) for seas,team in zip(seeds.Season, seeds.Team)]
seeds = seeds.set_index('Season/Team').drop(['Season', 'Team'],axis=1).squeeze().to_dict()
tgames['Wteam_seed'] = [seeds[(year,team)] for year,team in zip(tgames.Season,tgames.Wteam)]
tgames['Lteam_seed'] = [seeds[(year,team)] for year,team in zip(tgames.Season,tgames.Lteam)]
tgames['Wteam_seed'] = tgames['Wteam_seed'].str.replace(r'\D+', '').astype('int')
tgames['Lteam_seed'] = tgames['Lteam_seed'].str.replace(r'\D+', '').astype('int')
tgames.columns = map(str.lower, tgames.columns)
Two conditions need to be met for a game to be coded as an upset. Condition 1 is that the game involves opponents with an absolute seed difference greater than 3. Condition 2 is that the team with the higher numeric seed wins the game.
To label upsets I created calculated the absolute seed difference and then created two functions to examine these conditions, and used these functions to create indicator columns in the tournament games dataset.
In [64]:
tgames['seedif'] = abs(tgames.wteam_seed - tgames.lteam_seed)
# label each matchup as a potential upset (1) or not (0)
def upset_pot(data):
if data.seedif > 3:
return 1
else:
return 0
# label each matchup as an upset (1) or not (0)
def upset_label(data):
x = data.seedif
if (data.wteam_seed > data.lteam_seed) & (x > 3):
return 1
else:
return 0
# identify potential upsets
# defined as games with seed difference greater than 3
tgames['upsetpot'] = tgames.apply(upset_pot, axis=1)
tgames['upset'] = tgames.apply(upset_label, axis=1)
In [65]:
tgames[['wteam_seed', 'lteam_seed', 'seedif', 'upsetpot', 'upset']].head()
Out[65]:
In [66]:
tab(tgames.upsetpot)
65% of NCAA tournament games qualified as having "upset potential".
In [87]:
tab(tgames.upset)
Only 14% of all tournament games resulted in an upset.
In [88]:
tab(tgames[tgames.upsetpot==1].upset)
Of the games that had potential for an upset, 21% resulted in an upset.
In [ ]: