Ok let's get down to business! So the overall goal is to build a mathematical model that predicts with good accuracy who is likely to make it to the sweet 16 in the NCAA tournament. This project is going to have two parts:
-Performing an optimization/regression in order to write equations that predict a team's performance in the NCAA tournament based on their regular season statistics.
-Putting together a Markov Chain that will use these probabilities to predict the teams most likely to advance in the tourament. I'll explain more about a Markov Chain when we get there.
Let's start with Part 1!
In [ ]:
import pandas as pd
import numexpr
import bottleneck
import numpy as np
import numpy.linalg as linalg
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.stats as ss
In [ ]:
reg_14_15 = pd.read_csv('2014_2015 Regular Season Stats.csv')
In [84]:
#Testing out our system
reg_14_15
Out[84]:
Excellent, now we have datasets. The first thing to do is to rank teams based on their performance in each year's NCAA march-madness tournament. This part of the calculation is rather subjective- I'm going to individually rank teams by how well they did in the tournament. I need to do this because, if you think about it, there were two teams that lost in the final four, four that lost in the elite 8, etc. How do we rank these teams? We could put them at relatively the same ranking, which I will. But I'm also going to differentiate between a bad loss and a close game. So this isn't an exact science but that's ok because the results will show how good my ranking system was.
In the following cells I select only the 64 teams in the NCAA tournament from the above list of every single team in Division 1 College basketball, and I assign each team a ranking (my assigned ranking is in column 34).
In [80]:
reg_14_15 = reg_14_15.rename(columns={'Unnamed: 0': 'Number'})
#renaming the columns with integers so they can be more easily manipulated
d=[]
for i in range(0,34,1):
d.append(i)
d
reg_14_15.columns=[d]
#creating a new dataframe with only the teams in the tournamment
bracket_14_15=reg_14_15.iloc[[7,8,12,14,22,23,35,36,51,55,66,67,75,82,99,100,102,104,108,110,126,129,130,135,139,141,149,153,162,173,177,198,203,206,211,214,218,222,225,226,227,230,242,243,250,263,283,288,290,299,303,316,319,321,325,328,329,330,337,342,345,346,348,349],:]
In [81]:
newCol = [27,56,6,24,33,58,48,19,25,61,44,22,1,54,25,42,23,8,62,51,43,27,33,20,3,64,38,7,27,4,46,59,10,13,57,55,27,5,26,11,40,21,37,39,63,27,35,41,49,45,60,53,15,9,52,17,18,35,16,14,2,47,50,12]
newName = '34'
values = np.insert(bracket_14_15.values,bracket_14_15.shape[1],newCol,axis=1)
header = bracket_14_15.columns.values.tolist()
header.append(newName)
df = pd.DataFrame(values,columns=header)
df
Out[81]:
Now, for easier manipulation, we're going to convert the dataframe into a numpy array. Then we'll divide each value in the array by the total number of games that team played, ensuring we have 'per game' statistics.
In [ ]:
mat = np.zeros((64,32))
for j in range (0,64,1):
for i in range(3,34,1):
val = float(df.iat[j,i])/float(df.iat[j,2])
mat[j,i-3]=val
Next we're going to begin the regression. First, we define a matrix y for our regression such that
$$ \textbf{Y} =\textbf{X} * \textbf{b}$$where Y is our ratings, X is a matrix of our data points (each row represents the statistics for a single team), and b is our coefficients. I'm going to assume a linear relationship for now- I can play around with non-linear regressions later, but we really want to just get values for now and later we can figure out whether our regression is good.
In [ ]:
#creating our y matrix
ratings = np.zeros((64,1))
for j in range(0,64,1):
val = 64 - float(df.iat[j,34])
ratings[j] = val
Since we only want to use the statistics that are correlated with the ratings, we run a spearman correlation test on every statistic and select only the ones below our alpha level of $0.05$. These statistics then form our $\textbf{X}$ matrix. Next we use the "linalg.lstsq" regression function to perform a least squares regression of our data. Finally, I'll compute our predicted rankings by multiplying the $\textbf{X}$ and $b$ matrices.
In [72]:
coeffs = []
for i in range(0,32,1):
results = ss.spearmanr(mat[:,i],ratings)
if results[1] < .05:
coeffs.append(i)
xmat = []
for i in coeffs:
xmat.append(mat[:,i])
result = linalg.lstsq(np.transpose(xmat),ratings)
x_mat = np.asarray(xmat)
x_matT = np.transpose(np.asarray(xmat))
rating = np.transpose(np.asarray(ratings))
npresult = np.asarray(result[0])
dot = np.dot(np.transpose(npresult),x_mat)
dot
dotadjusted = np.zeros((1,64))
for i in range(0,64,1):
if dot[0,i] < 0:
dotadjusted[0,i] = 1
else:
dotadjusted[0,i] = dot[0,i]
Notice above that I had to make a cheeky and dubious adjustment- some of the predicted rankings came out negative, so to ensure that all rankings are positive (we'll need them positive to create our Markov chain), I change all negative rankings to a rank of 0. A higher ranking means a better team.
Alright, we now have an equation with 15 coefficients that predicts the ranking of a team based on its regular season stats. Now we are going to create a Markov Chain using these data!
Let's play a game called the jumping particle.
Consider a particle that can jump between multiple different states. On each turn of the game, the particle has a probability of jumping to another state or remaining in the current state. This group of states represents a Markov chain. The probability that a particle jumps to any particular state is written in the form of a "transition probability matrix." For example, consider a 2-state Markov Chain with states 0 and 1:
$$P = \left[ \begin{array}{cc} 0.4 & 0.6\\ 0.7 & 0.3\end{array}\right] $$In this case, the probability that a particle in state 0 on turn 1 jumps to state 1 on turn 2 is 0.6, and the probability it stays in state 0 is 0.4. Likewise, the probability that a particle in state 1 on turn 1 jumps to state 0 on turn 2 is 0.7 while the probability that it stays in state 1 is 0.3. Notice that each row sums to 1. This makes intuitive sense; the probability that the particle either jumps or stays must add to 1. It turns out that Markov Chains have lots of nice properties that we can exploit. First, however, we have to construct our transition probability matrix for our bracket.
Let's use our ranking system. Adopting a method suggested in Kvam et. al, we can define
$$p_{i,j}= \frac{r_j}{r_i+r_j}$$and $$p_{i,i} = \sum_{j = 1, j \neq i}^{64}\frac{r_i}{r_i+r_j}$$ where $r_i$represents the ranking of team i, $r_j$ the ranking of team j. Notice, however, that there is an issue; this does not necessarily sum to 1 for all the values in a row. In fact,
$$ p_{i,1} + p_{i,2} + ... + p_{i,i-1} + p_{i,i+1} + ... + p_{i,64} + p_{i,i} = $$$$ \frac{r_1}{r_i+r_1} + \frac{r_2}{r_i+r_2} + ... + \frac{r_{i-1}}{r_i+r_{i-1}} + \frac{r_{i+1}}{r_i+r_{i+1}} + ... + \frac{r_{64}}{r_i+r_{64}} + (\frac{r_i}{r_i+r_1} + ... + \frac{r_i}{r_i+r_{i-1}} + \frac{r_i}{r_i+r_{i+1}} + ... + \frac{r_i}{r_i+r_{64}}) = $$$$ \frac{r_i + r_1}{r_i+r_1} + ...\frac{r_i + r_{i-1}}{r_i+r_{i-1}} + \frac{r_i+r_{i+1}}{r_i+r_{i+1}} + ... + \frac{r_i+r_{64}}{r_i+r_{64}} = 63(1) = 63 $$So if we normalize by $\frac{1}{63}$ we should get rows that sum to 1. Now let's write the matrix.
In [ ]:
brac2015 = np.zeros((64,64))
def brac(i):
a=0
for j in range(0,64,1):
a = a + dotadjusted[0,i]/(dotadjusted[0,i]+dotadjusted[0,j])
return 1/(64*.9921875)*a
for i in range(0,64,1):
for j in range(0,64,1):
if i != j:
brac2015[i,j] = 1/(64*.9921875) * dotadjusted[0,j]/(dotadjusted[0,i] + dotadjusted[0,j])
if i == j:
brac2015[i,i] = brac(i)
brac2015transpose = np.transpose(brac2015)
Inexplicably the rows don't add to one unless we use the normalization factor $\frac{1}{64 * 0.9921875}$. No biggie.
This is a special type of Markov chain- because none of the values in the transition matrix are 1 or 0, it's possible to go from any state in the matrix to any other state. We call this a regular Markov chain. In fact, this Markov Chain is regular, aperiodic, and irreducible. The special property of such a Markov chain is that there's a limiting probability distribution. This means that if we evolve the Markov process over infinite iterations (i.e. you randomly go from state 0 to state 1 to state 7 to state 32 etc. etc. infinite times) there is a set probability that the particle will be in any given state at time infinity. The limiting distribution follows this equation:
$$ \pi* \textbf{P} = \pi$$where $\pi$ is the limiting distribution and $\textbf{P}$ is the transition probability matrix we constructed. Notice that $\pi$ is a 64-dimensional vector in our case.
We can use these limiting distributions! If we rank teams by their limiting distribution probabilities, we should be able to see which teams will be the most likely to win the tournament.
The other equation of importance is $$ \pi_1 + ... + \pi_{64} = 1$$ where $\pi = <\pi_1,\pi_2,...,\pi_{64}>$ which makes sense, since the particle must be in $\textit{some}$ state at time infinity (Note: $\pi_i$ is the probability that the particle will be in state i at time infinity).
So now we have 64 equations to solve 64 variables (the $\pi_i$).
In [ ]:
#replace last equation of P with the second boundary condition.
for i in range(0,63,1):
for j in range(0,63,1):
if i == j:
brac2015eq[i,j] = brac2015transpose[i,i] - 1
if i != j:
brac2015eq[i,j] = brac2015transpose[i,j]
for i in range(0,64,1):
brac2015eq[63,i] = 1
b = np.zeros((64,1))
b[63,0] = 1
a = np.zeros((64,1))
c = []
d = []
for i in range(0,64,1):
cat = np.linalg.solve(brac2015eq,b)[i,0]
c.append(cat)
d.append(df.iat[i,1])
e = pd.Series(d)
f = pd.Series(c)
predictions = pd.DataFrame({ 'Team Name' : e,
'Steady State Probability' : f})
finalpredictions = predictions.sort_values(by = 'Steady State Probability')
print(finalpredictions.tail())
Now the interesting part!!! We get to apply this to new data sets. Because I'm still salty about how poorly my bracket did this year (my beloved MSU Spartans fell in the first round...) let's take a look and see whether this rating scheme is good for the 2016 March Madness bracket. First, we need to call a new data set.
In [65]:
reg_15_16 = pd.read_csv('2015_2016 Regular Season Stats.csv')
reg_15_16.head()
Out[65]:
Luckily for me I don't have to create rankings for this set; I can just plug in the regular season stats of the 64 teams in the bracket and see what the program predicts. So let's do that!!!
In [93]:
#teams = the 64 teams in the bracket that year. bracket = the associated data.
def predictor(regseasonstats,teams,vars,coefficients):
'''This function takes in multiple different constraints and outputs the teams most likely to win the NCAA tournament and
their probabilities of winning. Inputs:
regseasonstats = uploaded CSV file containing statistics for all teams as a Pandas Dataframe
teams = a list of the numerical indices associated with the 64 teams in the NCAA bracket that year
vars = the numerical values of the column headers of the variables desired to use in the regression
coefficients = the associated coefficients for each variable.'''
d=[]
for i in range(0,34,1):
d.append(i)
regseasonstats.columns=[d]
bracket = regseasonstats.iloc[teams,:]
mat = np.zeros((64,32))
for j in range (0,64,1):
for i in range(3,34,1):
val = float(bracket.iat[j,i])/float(bracket.iat[j,2])
mat[j,i-3]=val
xmat = []
for i in vars:
xmat.append(mat[:,i])
x_mat = np.asarray(xmat)
np.result = np.asarray(coefficients)
dot = np.dot(np.transpose(npresult),x_mat)
dotadjusted = np.zeros((1,64))
for i in range(0,64,1):
if dot[0,i] < 0:
dotadjusted[0,i] = 1
else:
dotadjusted[0,i] = dot[0,i]
#Making the Markov transition matrix
brac2015 = np.zeros((64,64))
def brac(i):
a=0
for j in range(0,64,1):
a = a + dotadjusted[0,i]/(dotadjusted[0,i]+dotadjusted[0,j])
return 1/(64*.9921875)*a
for i in range(0,64,1):
for j in range(0,64,1):
if i != j:
brac2015[i,j] = 1/(64*.9921875) * dotadjusted[0,j]/(dotadjusted[0,i] + dotadjusted[0,j])
if i == j:
brac2015[i,i] = brac(i)
brac2015transpose = np.transpose(brac2015)
for i in range(0,63,1):
for j in range(0,63,1):
if i == j:
brac2015eq[i,j] = brac2015transpose[i,i] - 1
if i != j:
brac2015eq[i,j] = brac2015transpose[i,j]
for i in range(0,64,1):
brac2015eq[63,i] = 1
b = np.zeros((64,1))
b[63,0] = 1
a = np.zeros((64,1))
mat1 = []
mat2 = []
for i in range(0,64,1):
cat = np.linalg.solve(brac2015eq,b)[i,0]
mat1.append(cat)
mat2.append(bracket.iat[i,1])
teamname = pd.Series(mat2)
probability = pd.Series(mat1)
predictions = pd.DataFrame({ 'Team Name' : teamname,
'Steady State Probability' : probability})
finalpredictions = predictions.sort_values(by = 'Steady State Probability')
return(finalpredictions[48:64])
#Here we define
teams2016 = [12,16,20,22,35,36,38,49,51,58,61,67,75,90,94,104,107,108,111,114,126,128,129,130,135,139,162,170,172
,173,174,203,207,209,218,222,226,230,231,236,242,243,256,269,276,281,290,292,293,294,299,300,305,320,
321,328,329,330,336,337,342,345,349,350]
#so
predictor(reg_15_16,teams2016,coeffs,result[0])
Out[93]:
This predictor correctly predicted:
The winner of the tournament
The runner up of the tournament
$\frac{2}{4}$ final four
$\frac{5}{8}$ elite 8
$\frac{10}{16}$ sweet 16.
I tried to find expert brackets on Sports Illustrated to compare with.
Seth Davis:
$\frac{0}{4}$ final four
$\frac{4}{8}$ elite 8
$\frac{8}{16}$ sweet 16
Pete Thamel:
$\frac{1}{4}$ final four
$\frac{4}{8}$ elite 8
$\frac{13}{16}$ sweet 16
Lindsay Schnell
$\frac{0}{4}$ final four
$\frac{4}{8}$ elite 8
$\frac{8}{16}$ sweet 16
Those were just the first 3 I found, then I got lazy. My bracket is right there in the weeds with them; not better really, but definitely not worse. So this function here took my predicting skills from those of your average basketball fan to those of an SI "expert" ;). So all in all, my bracket predictions for the 2016 tournament are pretty good! But I suspect there may be some inappropriate advantages I gained by just using the 2015 data, since the 2015 and 2016 data are probably related since there are still players from the 2015 teams on the 2016 teams. Thus I need to test this for other years as well.
Before now and next Thursday I'll see if I should rework the regression by analyzing the data better, I'll test the bracket for another year or two, and I'll add more data to make my regression coefficients more accurate!
All datasets came from: http://www.sports-reference.com/cbb/ Sports Reference has kindly and meticulously compiled regular-season statistics on all college basketball teams.
The past brackets came from: http://www.printyourbrackets.com/
Alphabetized NCAA tournament team lists came from: http://www.cbssports.com/collegebasketball/eye-on-college-basketball/25515927/ncaa-bracket-tournament-committees-official-1-68-seed-list
Sports Illustrated expert brackets came from: http://www.si.com/college-basketball/2016/03/14/2016-ncaa-tournament-bracket-expert-picks#_
In [ ]:
In [ ]: