Markov Madness

Ok let's get down to business! So the overall goal is to build a mathematical model that predicts with good accuracy who is likely to make it to the sweet 16 in the NCAA tournament. This project is going to have two parts:

Part 1:

-Performing an optimization/regression in order to write equations that predict a team's performance in the NCAA tournament based on their regular season statistics.

Part 2:

-Putting together a Markov Chain that will use these probabilities to predict the teams most likely to advance in the tourament. I'll explain more about a Markov Chain when we get there.

Let's start with Part 1!

Part 1: The Regression

In order to perform this regression we need to set up a system to pull data from CSV files.


In [ ]:
import pandas as pd
import numexpr
import bottleneck
import numpy as np
import numpy.linalg as linalg
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.stats as ss

In [ ]:
reg_14_15 = pd.read_csv('2014_2015 Regular Season Stats.csv')

In [84]:
#Testing out our system
reg_14_15


Out[84]:
0 1 2 3 4 5 6 7 8 9 ... 24 25 26 27 28 29 30 31 32 33
0 Rk School ▴ G W L W-L% SRS SOS W L ... FT FTA FT% ORB TRB AST STL BLK TOV PF
1 1 Abilene Christian 31 10 21 0.323 -17.2 -6.34 4 14 ... 354 487 0.727 274 897 394 203 38 399 661
2 2 Air Force 31 14 17 0.452 -1.85 -0.71 6 12 ... 321 487 0.659 265 930 483 195 65 366 552
3 3 Akron 35 21 14 0.6 3.65 -0.47 9 9 ... 406 617 0.658 420 1249 435 230 155 422 654
4 4 Alabama 34 19 15 0.559 10.52 8.09 8 10 ... 539 752 0.717 325 1122 350 219 125 407 654
5 5 Alabama A&M 29 9 20 0.31 -17.15 -9.62 8 10 ... 415 643 0.645 367 998 354 165 68 413 573
6 6 Alabama State 29 19 10 0.655 -8.97 -12.71 14 4 ... 396 609 0.65 389 1151 393 184 122 397 520
7 7 Alabama-Birmingham 36 20 16 0.556 1.94 1.62 12 6 ... 528 715 0.738 434 1322 516 237 184 482 630
8 8 Albany (NY) 33 24 9 0.727 -0.34 -5.18 15 1 ... 493 650 0.758 334 1111 349 187 53 381 544
9 9 Alcorn State 32 6 26 0.188 -20.75 -9.95 4 14 ... 463 704 0.658 376 1038 285 191 86 367 571
10 10 American 33 17 16 0.515 -3.56 -3.71 8 10 ... 359 509 0.705 187 827 408 196 84 370 484
11 11 Appalachian State 29 12 17 0.414 -9.81 -3.18 9 11 ... 381 537 0.709 329 1010 318 122 61 397 551
12 12 Arizona 38 34 4 0.895 24.33 7.41 16 2 ... 700 974 0.719 411 1399 528 275 131 419 681
13 13 Arizona State 34 18 16 0.529 9.52 6.66 9 9 ... 511 766 0.667 365 1184 431 199 79 474 638
14 14 Arkansas 36 27 9 0.75 14.07 6.79 13 5 ... 593 819 0.724 470 1290 579 277 168 430 684
15 15 Arkansas State 29 11 18 0.379 -10.1 -3.81 6 14 ... 372 530 0.702 318 1038 321 154 89 390 496
16 16 Arkansas-Little Rock 31 13 18 0.419 -5.38 -3.81 8 12 ... 445 632 0.704 329 1063 380 174 86 326 588
17 17 Arkansas-Pine Bluff 32 12 20 0.375 -15.65 -8.61 9 9 ... 397 584 0.68 326 927 323 283 108 489 623
18 18 Army 30 15 15 0.5 -6.5 -4.98 6 12 ... 379 557 0.68 325 1033 437 192 106 396 612
19 19 Auburn 35 15 20 0.429 3.27 8.15 4 14 ... 606 910 0.666 356 1104 376 238 89 452 708
20 20 Austin Peay 30 8 22 0.267 -12.45 -2.81 3 13 ... 456 685 0.666 373 1043 360 191 124 428 514
21 21 Ball State 30 7 23 0.233 -5.06 0.04 2 16 ... 389 561 0.693 303 980 383 159 71 424 576
22 22 Baylor 34 24 10 0.706 17.85 9.33 11 7 ... 487 723 0.674 497 1333 497 263 132 434 565
23 23 Belmont 33 22 11 0.667 0.35 -2.77 11 5 ... 405 585 0.692 307 1093 504 209 66 451 549
24 24 Bethune-Cookman 32 11 21 0.344 -15.52 -10.45 7 9 ... 482 672 0.717 372 1056 264 174 84 449 623
25 25 Binghamton 32 6 26 0.188 -13.45 -3.49 5 11 ... 408 618 0.66 288 992 278 186 80 433 577
26 26 Boise State 34 25 9 0.735 10.01 1.38 14 4 ... 431 591 0.729 298 1119 409 206 79 348 582
27 27 Boston College 32 13 19 0.406 6.32 7.94 4 14 ... 412 584 0.705 263 983 421 183 86 389 607
28 28 Boston University 30 13 17 0.433 -5.63 -3.12 9 9 ... 469 674 0.696 263 953 368 170 56 369 522
29 29 Bowling Green State 33 21 12 0.636 2.93 -1.25 11 7 ... 420 588 0.714 355 1156 406 219 123 386 550
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
322 322 Utah State 31 18 13 0.581 2.92 0.28 11 7 ... 389 579 0.672 215 978 445 149 142 323 505
323 323 Utah Valley 30 11 19 0.367 -12.72 -4.87 5 9 ... 357 514 0.695 297 989 387 134 69 385 519
324 324 VMI 30 11 19 0.367 -9.85 -3.92 7 11 ... 415 607 0.684 434 1178 408 255 177 449 646
325 325 Valparaiso 34 28 6 0.824 6.29 -2 13 3 ... 456 676 0.675 397 1276 443 203 157 422 563
326 326 Vanderbilt 35 21 14 0.6 11.89 6.22 9 9 ... 541 764 0.708 339 1184 541 168 186 452 603
327 327 Vermont 34 20 14 0.588 1.21 -5.52 12 4 ... 495 715 0.692 274 1129 446 216 173 442 648
328 328 Villanova 36 33 3 0.917 23.13 7.55 16 2 ... 596 815 0.731 371 1235 570 278 146 382 596
329 329 Virginia 34 30 4 0.882 21.73 7.81 16 2 ... 446 614 0.726 358 1232 430 186 143 316 484
330 330 Virginia Commonwealth 36 26 10 0.722 12.79 6.1 12 6 ... 524 797 0.657 444 1277 456 342 153 384 690
331 331 Virginia Tech 33 11 22 0.333 1.29 5.32 2 16 ... 357 558 0.64 274 978 395 144 97 386 598
332 332 Wagner 30 10 20 0.333 -11.47 -4.99 8 10 ... 429 616 0.696 445 1112 324 195 79 420 681
333 333 Wake Forest 32 13 19 0.406 4.62 7.68 5 13 ... 479 751 0.638 375 1200 451 208 123 428 579
334 334 Washington 31 16 15 0.516 5.57 4.83 5 13 ... 405 593 0.683 329 1082 412 152 153 361 535
335 335 Washington State 31 13 18 0.419 0.14 6.08 7 11 ... 489 679 0.72 282 1026 453 120 102 367 570
336 336 Weber State 30 13 17 0.433 -7.29 -3.5 8 10 ... 511 683 0.748 283 997 358 158 120 402 539
337 337 West Virginia 35 25 10 0.714 15.12 9.29 11 7 ... 571 863 0.662 575 1272 494 376 96 453 821
338 338 Western Carolina 32 15 17 0.469 -5.83 -3.56 9 9 ... 456 651 0.7 401 1130 375 260 141 435 671
339 339 Western Illinois 28 8 20 0.286 -13.81 -2.93 3 13 ... 323 463 0.698 207 907 335 148 67 352 483
340 340 Western Kentucky 32 20 12 0.625 1.12 -0.44 12 6 ... 451 690 0.654 363 1190 456 175 106 413 580
341 341 Western Michigan 34 20 14 0.588 0.25 0.19 10 8 ... 527 750 0.703 338 1108 424 220 77 406 615
342 342 Wichita State 35 30 5 0.857 15.56 3.68 17 1 ... 501 715 0.701 415 1245 476 250 134 327 587
343 343 William & Mary 33 20 13 0.606 2.82 -1.15 12 6 ... 453 617 0.734 248 1065 487 204 137 379 513
344 344 Winthrop 32 19 13 0.594 -3.9 -7.38 12 6 ... 481 664 0.724 332 1127 388 224 140 387 574
345 345 Wisconsin 40 36 4 0.9 24.61 10.29 16 2 ... 584 765 0.763 383 1338 504 174 127 296 501
346 346 Wofford 35 28 7 0.8 2.56 -2.98 16 2 ... 473 688 0.688 324 1139 461 228 87 379 589
347 347 Wright State 31 11 20 0.355 -6.41 -1.14 3 13 ... 365 524 0.697 232 902 407 168 75 373 594
348 348 Wyoming 35 25 10 0.714 3.81 -0.04 11 7 ... 448 638 0.702 216 1047 501 191 102 392 481
349 349 Xavier 37 23 14 0.622 15.42 9.48 9 9 ... 566 775 0.73 364 1282 596 223 106 447 661
350 350 Yale 33 22 10 0.688 3.53 -0.87 11 4 ... 464 653 0.711 360 1139 437 188 127 406 552
351 351 Youngstown State 32 11 21 0.344 -7.63 -1.73 2 14 ... 465 670 0.694 314 1055 450 192 67 374 586

352 rows × 34 columns

Excellent, now we have datasets. The first thing to do is to rank teams based on their performance in each year's NCAA march-madness tournament. This part of the calculation is rather subjective- I'm going to individually rank teams by how well they did in the tournament. I need to do this because, if you think about it, there were two teams that lost in the final four, four that lost in the elite 8, etc. How do we rank these teams? We could put them at relatively the same ranking, which I will. But I'm also going to differentiate between a bad loss and a close game. So this isn't an exact science but that's ok because the results will show how good my ranking system was.

In the following cells I select only the 64 teams in the NCAA tournament from the above list of every single team in Division 1 College basketball, and I assign each team a ranking (my assigned ranking is in column 34).


In [80]:
reg_14_15 = reg_14_15.rename(columns={'Unnamed: 0': 'Number'})

#renaming the columns with integers so they can be more easily manipulated
d=[]
for i in range(0,34,1):
    d.append(i)
d
reg_14_15.columns=[d]

#creating a new dataframe with only the teams in the tournamment
bracket_14_15=reg_14_15.iloc[[7,8,12,14,22,23,35,36,51,55,66,67,75,82,99,100,102,104,108,110,126,129,130,135,139,141,149,153,162,173,177,198,203,206,211,214,218,222,225,226,227,230,242,243,250,263,283,288,290,299,303,316,319,321,325,328,329,330,337,342,345,346,348,349],:]

In [81]:
newCol = [27,56,6,24,33,58,48,19,25,61,44,22,1,54,25,42,23,8,62,51,43,27,33,20,3,64,38,7,27,4,46,59,10,13,57,55,27,5,26,11,40,21,37,39,63,27,35,41,49,45,60,53,15,9,52,17,18,35,16,14,2,47,50,12]
newName = '34'

values = np.insert(bracket_14_15.values,bracket_14_15.shape[1],newCol,axis=1)
header = bracket_14_15.columns.values.tolist()
header.append(newName)

df = pd.DataFrame(values,columns=header)
df


Out[81]:
0 1 2 3 4 5 6 7 8 9 ... 25 26 27 28 29 30 31 32 33 34
0 7 Alabama-Birmingham 36 20 16 0.556 1.94 1.62 12 6 ... 715 0.738 434 1322 516 237 184 482 630 27
1 8 Albany (NY) 33 24 9 0.727 -0.34 -5.18 15 1 ... 650 0.758 334 1111 349 187 53 381 544 56
2 12 Arizona 38 34 4 0.895 24.33 7.41 16 2 ... 974 0.719 411 1399 528 275 131 419 681 6
3 14 Arkansas 36 27 9 0.75 14.07 6.79 13 5 ... 819 0.724 470 1290 579 277 168 430 684 24
4 22 Baylor 34 24 10 0.706 17.85 9.33 11 7 ... 723 0.674 497 1333 497 263 132 434 565 33
5 23 Belmont 33 22 11 0.667 0.35 -2.77 11 5 ... 585 0.692 307 1093 504 209 66 451 549 58
6 35 Buffalo 33 23 10 0.697 7.93 1.6 12 6 ... 819 0.719 420 1260 433 248 132 378 613 48
7 36 Butler 34 23 11 0.676 16.64 8.62 12 6 ... 767 0.686 408 1239 387 217 103 376 592 19
8 51 Cincinnati 34 23 11 0.676 10.72 4.4 13 5 ... 633 0.67 383 1178 394 222 185 432 507 25
9 55 Coastal Carolina 34 24 10 0.706 -1.02 -6.18 12 6 ... 780 0.69 447 1338 385 240 111 399 559 61
10 66 Davidson 32 24 8 0.75 11.97 3.19 14 4 ... 553 0.705 335 1144 547 171 93 303 601 44
11 67 Dayton 36 27 9 0.75 10.98 4.06 13 5 ... 846 0.686 255 1121 490 250 91 415 573 22
12 75 Duke 39 35 4 0.897 24.97 9.87 15 3 ... 873 0.699 445 1436 588 278 149 421 607 1
13 82 Eastern Washington 35 26 9 0.743 0.79 -3.84 14 4 ... 719 0.726 351 1183 454 219 167 376 624 54
14 99 Georgetown 33 22 11 0.667 15.48 9.81 12 6 ... 769 0.704 367 1152 427 251 156 415 668 25
15 100 Georgia 33 21 12 0.636 12.42 8.6 11 7 ... 826 0.692 360 1260 411 183 151 420 608 42
16 102 Georgia State 35 25 10 0.714 5.47 -2.03 15 5 ... 719 0.725 322 1116 479 321 143 366 642 23
17 104 Gonzaga 38 35 3 0.921 21.12 5.12 17 1 ... 806 0.691 391 1441 620 226 126 397 637 8
18 108 Hampton 35 17 18 0.486 -8.25 -7.08 8 8 ... 859 0.652 416 1292 372 235 141 494 752 62
19 110 Harvard 30 22 8 0.733 4.47 -0.25 12 3 ... 609 0.719 316 1025 384 213 135 349 472 51
20 126 Indiana 34 20 14 0.588 12.94 7.2 9 9 ... 651 0.719 407 1224 473 170 108 392 606 43
21 129 Iowa 34 22 12 0.647 15.83 8.39 12 6 ... 723 0.744 422 1274 482 218 158 377 536 27
22 130 Iowa State 34 25 9 0.735 17.94 9.15 12 6 ... 721 0.696 323 1221 546 218 133 375 527 33
23 135 Kansas 36 27 9 0.75 18.47 12.13 13 5 ... 842 0.724 426 1364 475 237 175 463 639 20
24 139 Kentucky 39 38 1 0.974 28.72 8.67 18 0 ... 943 0.726 491 1482 553 253 268 408 665 3
25 141 Lafayette 33 20 13 0.606 -3.37 -3.12 9 9 ... 549 0.767 282 1017 514 181 83 366 515 64
26 149 Louisiana State 33 22 11 0.667 11.39 5.63 11 7 ... 658 0.684 401 1294 516 240 204 476 516 38
27 153 Louisville 36 27 9 0.75 18.12 8.98 12 6 ... 792 0.668 454 1351 419 283 192 408 623 7
28 162 Maryland 35 28 7 0.8 13.32 7.63 14 4 ... 790 0.756 314 1223 376 169 151 431 584 27
29 173 Michigan State 39 27 12 0.692 17.46 10 12 6 ... 714 0.632 436 1458 646 203 176 441 738 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
34 211 North Dakota State 33 23 10 0.697 -1.72 -3.63 12 4 ... 614 0.692 274 1098 315 156 123 325 504 57
35 214 Northeastern 35 23 12 0.657 2.03 -1.34 12 6 ... 713 0.727 297 1160 500 183 129 483 496 55
36 218 Northern Iowa 35 31 4 0.886 12.54 2 16 2 ... 644 0.733 254 1071 413 207 84 363 508 27
37 222 Notre Dame 38 32 6 0.842 19.23 7.02 14 4 ... 756 0.742 313 1231 576 256 138 357 540 5
38 225 Ohio State 35 24 11 0.686 19.68 7.37 11 7 ... 693 0.68 387 1241 534 266 178 396 563 26
39 226 Oklahoma 35 24 11 0.686 18.65 9.94 12 6 ... 668 0.741 389 1332 430 237 174 419 570 11
40 227 Oklahoma State 32 18 14 0.563 12.95 9.63 8 10 ... 645 0.726 287 1039 366 253 156 392 614 40
41 230 Oregon 36 26 10 0.722 11.43 7.65 13 5 ... 613 0.76 372 1298 508 207 157 421 655 21
42 242 Providence 34 22 12 0.647 13.71 9.5 11 7 ... 768 0.707 411 1228 474 242 155 407 628 37
43 243 Purdue 34 21 13 0.618 12.93 7.64 12 6 ... 767 0.683 409 1249 505 185 177 444 642 39
44 250 Robert Morris 35 20 15 0.571 -3.02 -3.5 12 6 ... 684 0.709 367 1162 428 300 100 467 650 63
45 263 San Diego State 36 27 9 0.75 11.81 4.24 14 4 ... 694 0.63 439 1270 377 247 179 410 506 27
46 283 Southern Methodist 34 27 7 0.794 12.94 3.57 15 3 ... 770 0.703 388 1244 524 237 146 426 551 35
47 288 St. John's (NY) 33 21 12 0.636 11.85 9.04 10 8 ... 687 0.691 336 1164 408 246 210 352 587 41
48 290 Stephen F. Austin 34 29 5 0.853 8.61 -4.2 17 1 ... 799 0.736 413 1176 596 259 120 476 734 49
49 299 Texas 34 20 14 0.588 16.16 9.07 8 10 ... 671 0.724 422 1351 442 128 265 435 555 45
50 303 Texas Southern 35 22 13 0.629 -5.98 -6.09 16 2 ... 861 0.661 394 1193 375 214 109 454 609 60
51 316 UC-Irvine 34 21 13 0.618 3.63 0.45 11 5 ... 541 0.684 329 1198 480 194 168 396 578 53
52 319 UCLA 36 22 14 0.611 12.07 8.27 11 7 ... 750 0.676 440 1367 498 238 139 430 645 15
53 321 Utah 35 26 9 0.743 20.98 7.42 13 5 ... 760 0.703 322 1204 489 208 175 407 581 9
54 325 Valparaiso 34 28 6 0.824 6.29 -2 13 3 ... 676 0.675 397 1276 443 203 157 422 563 52
55 328 Villanova 36 33 3 0.917 23.13 7.55 16 2 ... 815 0.731 371 1235 570 278 146 382 596 17
56 329 Virginia 34 30 4 0.882 21.73 7.81 16 2 ... 614 0.726 358 1232 430 186 143 316 484 18
57 330 Virginia Commonwealth 36 26 10 0.722 12.79 6.1 12 6 ... 797 0.657 444 1277 456 342 153 384 690 35
58 337 West Virginia 35 25 10 0.714 15.12 9.29 11 7 ... 863 0.662 575 1272 494 376 96 453 821 16
59 342 Wichita State 35 30 5 0.857 15.56 3.68 17 1 ... 715 0.701 415 1245 476 250 134 327 587 14
60 345 Wisconsin 40 36 4 0.9 24.61 10.29 16 2 ... 765 0.763 383 1338 504 174 127 296 501 2
61 346 Wofford 35 28 7 0.8 2.56 -2.98 16 2 ... 688 0.688 324 1139 461 228 87 379 589 47
62 348 Wyoming 35 25 10 0.714 3.81 -0.04 11 7 ... 638 0.702 216 1047 501 191 102 392 481 50
63 349 Xavier 37 23 14 0.622 15.42 9.48 9 9 ... 775 0.73 364 1282 596 223 106 447 661 12

64 rows × 35 columns

Now, for easier manipulation, we're going to convert the dataframe into a numpy array. Then we'll divide each value in the array by the total number of games that team played, ensuring we have 'per game' statistics.


In [ ]:
mat = np.zeros((64,32))

for j in range (0,64,1):
    for i in range(3,34,1):
        val = float(df.iat[j,i])/float(df.iat[j,2])
        mat[j,i-3]=val

Next we're going to begin the regression. First, we define a matrix y for our regression such that

$$ \textbf{Y} =\textbf{X} * \textbf{b}$$

where Y is our ratings, X is a matrix of our data points (each row represents the statistics for a single team), and b is our coefficients. I'm going to assume a linear relationship for now- I can play around with non-linear regressions later, but we really want to just get values for now and later we can figure out whether our regression is good.


In [ ]:
#creating our y matrix
ratings = np.zeros((64,1))
for j in range(0,64,1):
    val = 64 - float(df.iat[j,34])
    ratings[j] = val

Since we only want to use the statistics that are correlated with the ratings, we run a spearman correlation test on every statistic and select only the ones below our alpha level of $0.05$. These statistics then form our $\textbf{X}$ matrix. Next we use the "linalg.lstsq" regression function to perform a least squares regression of our data. Finally, I'll compute our predicted rankings by multiplying the $\textbf{X}$ and $b$ matrices.


In [72]:
coeffs = []
for i in range(0,32,1):
    results = ss.spearmanr(mat[:,i],ratings)
    if results[1] < .05:
        coeffs.append(i)
xmat = []
for i in coeffs:
    xmat.append(mat[:,i])
    

result = linalg.lstsq(np.transpose(xmat),ratings)

x_mat = np.asarray(xmat)
x_matT = np.transpose(np.asarray(xmat))
rating = np.transpose(np.asarray(ratings))
npresult = np.asarray(result[0])
dot = np.dot(np.transpose(npresult),x_mat)


dot
dotadjusted = np.zeros((1,64))
for i in range(0,64,1):
    if dot[0,i] < 0:
        dotadjusted[0,i] = 1
    else:
        dotadjusted[0,i] = dot[0,i]

Notice above that I had to make a cheeky and dubious adjustment- some of the predicted rankings came out negative, so to ensure that all rankings are positive (we'll need them positive to create our Markov chain), I change all negative rankings to a rank of 0. A higher ranking means a better team.

Alright, we now have an equation with 15 coefficients that predicts the ranking of a team based on its regular season stats. Now we are going to create a Markov Chain using these data!

Part 2: The Markov Chain

Let's play a game called the jumping particle.

Consider a particle that can jump between multiple different states. On each turn of the game, the particle has a probability of jumping to another state or remaining in the current state. This group of states represents a Markov chain. The probability that a particle jumps to any particular state is written in the form of a "transition probability matrix." For example, consider a 2-state Markov Chain with states 0 and 1:

$$P = \left[ \begin{array}{cc} 0.4 & 0.6\\ 0.7 & 0.3\end{array}\right] $$

In this case, the probability that a particle in state 0 on turn 1 jumps to state 1 on turn 2 is 0.6, and the probability it stays in state 0 is 0.4. Likewise, the probability that a particle in state 1 on turn 1 jumps to state 0 on turn 2 is 0.7 while the probability that it stays in state 1 is 0.3. Notice that each row sums to 1. This makes intuitive sense; the probability that the particle either jumps or stays must add to 1. It turns out that Markov Chains have lots of nice properties that we can exploit. First, however, we have to construct our transition probability matrix for our bracket.

Let's use our ranking system. Adopting a method suggested in Kvam et. al, we can define

$$p_{i,j}= \frac{r_j}{r_i+r_j}$$

and $$p_{i,i} = \sum_{j = 1, j \neq i}^{64}\frac{r_i}{r_i+r_j}$$ where $r_i$represents the ranking of team i, $r_j$ the ranking of team j. Notice, however, that there is an issue; this does not necessarily sum to 1 for all the values in a row. In fact,

$$ p_{i,1} + p_{i,2} + ... + p_{i,i-1} + p_{i,i+1} + ... + p_{i,64} + p_{i,i} = $$$$ \frac{r_1}{r_i+r_1} + \frac{r_2}{r_i+r_2} + ... + \frac{r_{i-1}}{r_i+r_{i-1}} + \frac{r_{i+1}}{r_i+r_{i+1}} + ... + \frac{r_{64}}{r_i+r_{64}} + (\frac{r_i}{r_i+r_1} + ... + \frac{r_i}{r_i+r_{i-1}} + \frac{r_i}{r_i+r_{i+1}} + ... + \frac{r_i}{r_i+r_{64}}) = $$$$ \frac{r_i + r_1}{r_i+r_1} + ...\frac{r_i + r_{i-1}}{r_i+r_{i-1}} + \frac{r_i+r_{i+1}}{r_i+r_{i+1}} + ... + \frac{r_i+r_{64}}{r_i+r_{64}} = 63(1) = 63 $$

So if we normalize by $\frac{1}{63}$ we should get rows that sum to 1. Now let's write the matrix.


In [ ]:
brac2015 = np.zeros((64,64))

def brac(i):
    a=0
    for j in range(0,64,1):
        a = a + dotadjusted[0,i]/(dotadjusted[0,i]+dotadjusted[0,j])
    return 1/(64*.9921875)*a
        
        
for i in range(0,64,1):
    for j in range(0,64,1):
        if i != j:
            brac2015[i,j] = 1/(64*.9921875) * dotadjusted[0,j]/(dotadjusted[0,i] + dotadjusted[0,j])
        if i == j:
            brac2015[i,i] = brac(i)

brac2015transpose = np.transpose(brac2015)

Inexplicably the rows don't add to one unless we use the normalization factor $\frac{1}{64 * 0.9921875}$. No biggie.

This is a special type of Markov chain- because none of the values in the transition matrix are 1 or 0, it's possible to go from any state in the matrix to any other state. We call this a regular Markov chain. In fact, this Markov Chain is regular, aperiodic, and irreducible. The special property of such a Markov chain is that there's a limiting probability distribution. This means that if we evolve the Markov process over infinite iterations (i.e. you randomly go from state 0 to state 1 to state 7 to state 32 etc. etc. infinite times) there is a set probability that the particle will be in any given state at time infinity. The limiting distribution follows this equation:

$$ \pi* \textbf{P} = \pi$$

where $\pi$ is the limiting distribution and $\textbf{P}$ is the transition probability matrix we constructed. Notice that $\pi$ is a 64-dimensional vector in our case.

We can use these limiting distributions! If we rank teams by their limiting distribution probabilities, we should be able to see which teams will be the most likely to win the tournament.

The other equation of importance is $$ \pi_1 + ... + \pi_{64} = 1$$ where $\pi = <\pi_1,\pi_2,...,\pi_{64}>$ which makes sense, since the particle must be in $\textit{some}$ state at time infinity (Note: $\pi_i$ is the probability that the particle will be in state i at time infinity).

So now we have 64 equations to solve 64 variables (the $\pi_i$).


In [ ]:
#replace last equation of P with the second boundary condition.

for i in range(0,63,1):
    for j in range(0,63,1):
        if i == j:
            brac2015eq[i,j] = brac2015transpose[i,i] - 1
        if i != j:
            brac2015eq[i,j] = brac2015transpose[i,j]

for i in range(0,64,1):
    brac2015eq[63,i] = 1
    
b = np.zeros((64,1))
b[63,0] = 1
a = np.zeros((64,1))
c = []
d = []

for i in range(0,64,1):
    cat = np.linalg.solve(brac2015eq,b)[i,0]
    c.append(cat)
    d.append(df.iat[i,1])
    
e = pd.Series(d)
f = pd.Series(c)
    

predictions = pd.DataFrame({ 'Team Name' : e,
                             'Steady State Probability' : f})
finalpredictions = predictions.sort_values(by = 'Steady State Probability')
print(finalpredictions.tail())

Now the interesting part!!! We get to apply this to new data sets. Because I'm still salty about how poorly my bracket did this year (my beloved MSU Spartans fell in the first round...) let's take a look and see whether this rating scheme is good for the 2016 March Madness bracket. First, we need to call a new data set.


In [65]:
reg_15_16 = pd.read_csv('2015_2016 Regular Season Stats.csv')
reg_15_16.head()


Out[65]:
Unnamed: 0 Unnamed: 1 Overall Overall.1 Overall.2 Overall.3 Overall.4 Overall.5 Conf. Conf..1 ... School Totals.7 School Totals.8 School Totals.9 School Totals.10 School Totals.11 School Totals.12 School Totals.13 School Totals.14 School Totals.15 School Totals.16
0 Rk School ▴ G W L W-L% SRS SOS W L ... FT FTA FT% ORB TRB AST STL BLK TOV PF
1 1 Abilene Christian 31 13 18 0.419 -13.93 -7.53 8 10 ... 458 653 0.701 269 991 422 233 78 418 681
2 2 Air Force 32 14 18 0.438 -5.51 0.66 5 13 ... 483 707 0.683 283 1107 420 196 58 422 617
3 3 Akron 35 26 9 0.743 5.55 -1.24 13 5 ... 451 659 0.684 373 1310 507 190 137 378 618
4 4 Alabama 33 18 15 0.545 7.14 8.62 8 10 ... 458 712 0.643 327 1141 347 192 175 438 631

5 rows × 34 columns

Luckily for me I don't have to create rankings for this set; I can just plug in the regular season stats of the 64 teams in the bracket and see what the program predicts. So let's do that!!!


In [93]:
#teams = the 64 teams in the bracket that year. bracket = the associated data.
def predictor(regseasonstats,teams,vars,coefficients):
    '''This function takes in multiple different constraints and outputs the teams most likely to win the NCAA tournament and 
    their probabilities of winning. Inputs:
    regseasonstats = uploaded CSV file containing statistics for all teams as a Pandas Dataframe
    teams = a list of the numerical indices associated with the 64 teams in the NCAA bracket that year
    vars = the numerical values of the column headers of the variables desired to use in the regression
    coefficients = the associated coefficients for each variable.'''
    
    d=[]
    for i in range(0,34,1):
        d.append(i)
    regseasonstats.columns=[d]
    bracket = regseasonstats.iloc[teams,:]
    mat = np.zeros((64,32))
    for j in range (0,64,1):
        for i in range(3,34,1):
            val = float(bracket.iat[j,i])/float(bracket.iat[j,2])
            mat[j,i-3]=val
    xmat = []
    for i in vars:
        xmat.append(mat[:,i])
    x_mat = np.asarray(xmat)
    np.result = np.asarray(coefficients)
    dot = np.dot(np.transpose(npresult),x_mat)
    dotadjusted = np.zeros((1,64))
    for i in range(0,64,1):
        if dot[0,i] < 0:
            dotadjusted[0,i] = 1
        else:
            dotadjusted[0,i] = dot[0,i]
    
    #Making the Markov transition matrix
    brac2015 = np.zeros((64,64))
    def brac(i):
        a=0
        for j in range(0,64,1):
            a = a + dotadjusted[0,i]/(dotadjusted[0,i]+dotadjusted[0,j])
        return 1/(64*.9921875)*a
    for i in range(0,64,1):
        for j in range(0,64,1):
            if i != j:
                brac2015[i,j] = 1/(64*.9921875) * dotadjusted[0,j]/(dotadjusted[0,i] + dotadjusted[0,j])
            if i == j:
                brac2015[i,i] = brac(i)
    brac2015transpose = np.transpose(brac2015)
    
    for i in range(0,63,1):
        for j in range(0,63,1):
            if i == j:
                brac2015eq[i,j] = brac2015transpose[i,i] - 1
            if i != j:
                brac2015eq[i,j] = brac2015transpose[i,j]

    for i in range(0,64,1):
        brac2015eq[63,i] = 1

    b = np.zeros((64,1))
    b[63,0] = 1
    a = np.zeros((64,1))
    mat1 = []
    mat2 = []

    for i in range(0,64,1):
        cat = np.linalg.solve(brac2015eq,b)[i,0]
        mat1.append(cat)
        mat2.append(bracket.iat[i,1])

    teamname = pd.Series(mat2)
    probability = pd.Series(mat1)


    predictions = pd.DataFrame({ 'Team Name' : teamname,
                                 'Steady State Probability' : probability})
    finalpredictions = predictions.sort_values(by = 'Steady State Probability')
    return(finalpredictions[48:64])
                        
#Here we define 
teams2016 = [12,16,20,22,35,36,38,49,51,58,61,67,75,90,94,104,107,108,111,114,126,128,129,130,135,139,162,170,172
             ,173,174,203,207,209,218,222,226,230,231,236,242,243,256,269,276,281,290,292,293,294,299,300,305,320,
             321,328,329,330,336,337,342,345,349,350]
#so

predictor(reg_15_16,teams2016,coeffs,result[0])


Out[93]:
Steady State Probability Team Name
20 0.022086 Indiana
25 0.022277 Kentucky
50 0.022471 Texas
23 0.022951 Iowa State
12 0.023418 Duke
51 0.023475 Texas A&amp;M
35 0.023880 Notre Dame
10 0.024021 Connecticut
59 0.025424 West Virginia
37 0.025979 Oregon
57 0.026635 Virginia Commonwealth
24 0.026795 Kansas
60 0.027720 Wichita State
56 0.028821 Virginia
31 0.030500 North Carolina
55 0.033130 Villanova

This predictor correctly predicted:

  • The winner of the tournament

  • The runner up of the tournament

$\frac{2}{4}$ final four

$\frac{5}{8}$ elite 8

$\frac{10}{16}$ sweet 16.

I tried to find expert brackets on Sports Illustrated to compare with.

Seth Davis:

$\frac{0}{4}$ final four

$\frac{4}{8}$ elite 8

$\frac{8}{16}$ sweet 16

Pete Thamel:

$\frac{1}{4}$ final four

$\frac{4}{8}$ elite 8

$\frac{13}{16}$ sweet 16

Lindsay Schnell

$\frac{0}{4}$ final four

$\frac{4}{8}$ elite 8

$\frac{8}{16}$ sweet 16

Those were just the first 3 I found, then I got lazy. My bracket is right there in the weeds with them; not better really, but definitely not worse. So this function here took my predicting skills from those of your average basketball fan to those of an SI "expert" ;). So all in all, my bracket predictions for the 2016 tournament are pretty good! But I suspect there may be some inappropriate advantages I gained by just using the 2015 data, since the 2015 and 2016 data are probably related since there are still players from the 2015 teams on the 2016 teams. Thus I need to test this for other years as well.

Before now and next Thursday I'll see if I should rework the regression by analyzing the data better, I'll test the bracket for another year or two, and I'll add more data to make my regression coefficients more accurate!

References

All datasets came from: http://www.sports-reference.com/cbb/ Sports Reference has kindly and meticulously compiled regular-season statistics on all college basketball teams.

The past brackets came from: http://www.printyourbrackets.com/

Alphabetized NCAA tournament team lists came from: http://www.cbssports.com/collegebasketball/eye-on-college-basketball/25515927/ncaa-bracket-tournament-committees-official-1-68-seed-list

Sports Illustrated expert brackets came from: http://www.si.com/college-basketball/2016/03/14/2016-ncaa-tournament-bracket-expert-picks#_


In [ ]:


In [ ]: