Using integer programming and portfolio optimization to pick a March Madness bracket

In this notebook I explore the use of an integer program solver in Python to pick an optimal March Madness bracket.

I use the cvxopt Python interface to the glpk solver, both freely available. I also use pandas and numpy. Installing cvxopt and glpk was simple but not completely straightforward; see the appendix at the end for how I did it.

Scoring

The pool I was invited to had an interesting set of rules. You ignore the brackets and simply choose teams, but each team has a price and you have a limited budget:

Pick any set of teams you want as long as you stay within your $2000 budget. And all of your teams earn you points as long as they're still in the tournament.

0 points for play-in wins
1 point for every first round win
2 points for every second round win
3 points for every third round win
5 points for every quarterfinal win
8 points for every semifinal win
13 points for a finals win

Pricing structure:

1 seeds: $500
2 seeds: $300
3 seeds: $225
4 seeds: $175
5 seeds: $125
6 seeds: $125
7 seeds: $95
8 seeds: $85
9 seeds: $60
10 seeds: $65
11 seeds: $60
12 seeds: $55
13 seeds: $25
14 seeds: $20
15 seeds: $5
16 seeds: $1

I'm not sure if the price for 9 seeds was a typo.

Data

fivethirtyeight.com (Nate Silver's website) has a model for the tournament and provided data on the output of that model. The data includes the predicted probability that each team will win in each round, which is what we're after. I selected the most recent data set after all the play-in games had completed.



In [7]:

    
data = pd.read_csv('bracket-05.tsv', sep='\t')
data = data.\
    query('rd1_win > 0').\
    rename(columns=dict(rd1_win=1, rd2_win=2, rd3_win=3, rd4_win=4, rd5_win=5, rd6_win=6, rd7_win=7))\
    [['team_name', 'team_seed', 1, 2, 3, 4, 5, 6, 7]]
data.head()









    Out[7]:






  
    
      
      team_name
      team_seed
      1
      2
      3
      4
      5
      6
      7
    
  
  
    
      0
            Kentucky
         1
       1
       0.998983
       0.940664
       0.855849
       0.732191
       5.351611e-01
       4.131957e-01
    
    
      2
             Hampton
       16b
       1
       0.001017
       0.000103
       0.000012
       0.000001
       5.387219e-08
       4.307362e-09
    
    
      3
          Cincinnati
         8
       1
       0.536887
       0.034189
       0.017280
       0.007178
       2.065072e-03
       7.358615e-04
    
    
      4
              Purdue
         9
       1
       0.463113
       0.025044
       0.012093
       0.004774
       1.382256e-03
       4.958285e-04
    
    
      5
       West Virginia
         5
       1
       0.682463
       0.392345
       0.052872
       0.025248
       7.441731e-03
       2.693933e-03

The numbered columns represent the probability that a team will win in that round of the tournament. This of course means that they had to win all previous rounds, so you can see that the numbers are always decreasing from left to right.

I found it useful to think about the score you get from each team as a discrete random variable. In this case we need to change the above probabilities so that they represent the chance of winning exactly that number of games, so that for each team the sum of probabilities is 1.



In [8]:

    
data[8] = 0
for col in range(8, 1, -1):
    data[col] = data[col-1] - data[col]
data = data.drop(labels=1, axis=1)
data = data.rename(columns=dict(zip(range(2,9), range(7))))
data.head()









    Out[8]:






  
    
      
      team_name
      team_seed
      0
      1
      2
      3
      4
      5
      6
    
  
  
    
      0
            Kentucky
         1
       0.001017
       0.058320
       0.084815
       0.123658
       0.197030
       1.219654e-01
       4.131957e-01
    
    
      2
             Hampton
       16b
       0.998983
       0.000913
       0.000091
       0.000011
       0.000001
       4.956483e-08
       4.307362e-09
    
    
      3
          Cincinnati
         8
       0.463113
       0.502698
       0.016909
       0.010102
       0.005113
       1.329211e-03
       7.358615e-04
    
    
      4
              Purdue
         9
       0.536887
       0.438069
       0.012951
       0.007319
       0.003392
       8.864273e-04
       4.958285e-04
    
    
      5
       West Virginia
         5
       0.317537
       0.290118
       0.339473
       0.027624
       0.017807
       4.747797e-03
       2.693933e-03

Now we set up the data for the scoring rules of the pool:



In [9]:

    
rounds = range(7)
scores = [0, 1, 2, 3, 5, 8, 13]
cumscores = np.cumsum(scores)
prices =   {'1': 500,
            '2': 300,
            '3': 225,
            '4': 175,
            '5': 125,
            '6': 125,
            '7': 95,
            '8': 85,
            '9': 60,
            '10': 65,
            '11': 60,
            '11a': 60,
            '11b': 60,
            '12': 55,
            '13': 25,
            '14': 20,
            '15': 5,
            '16': 1,
            '16a': 1,
            '16b': 1}
budget = 2000
n = len(data)
data['price'] = [prices[seed] for seed in data.team_seed]

A few quantities which we'll be interested in are:

The expected score for each team. This is the average of the number of points for winning each possible number of games (0 through 6), weighted by the probability of winning that number of games.
The variance of the score for each team.
What I call the efficiency of each team. This is simply the expected score divided by the price.



In [10]:

    
def get_expected_score(team):
    return sum(team[r]*cumscores[r] for r in rounds)
data['expected_score'] = data.apply(get_expected_score, axis=1)

def get_variance(team):
    return sum(team[r]*(cumscores[r]-team['expected_score'])**2 for r in rounds)
data['variance'] = data.apply(get_variance, axis=1)

data['efficiency'] = data.expected_score/data.price



In [11]:

    
cols = ['team_name', 'team_seed', 'price', 'expected_score', 'variance', 'efficiency']
data[cols].sort(columns=['efficiency'], ascending=False).head()









    Out[11]:






  
    
      
      team_name
      team_seed
      price
      expected_score
      variance
      efficiency
    
  
  
    
      0 
               Kentucky
         1
       500
       18.761645
       144.256509
       0.037523
    
    
      16
       New Mexico State
        15
         5
        0.166190
         0.390123
       0.033238
    
    
      39
                   Utah
         5
       125
        4.123377
        27.834924
       0.032987
    
    
      32
                Arizona
         2
       300
        9.649058
        76.986045
       0.032164
    
    
      36
          Robert Morris
       16b
         1
        0.027626
         0.065626
       0.027626

Choosing the optimal set of teams

Maximizing expected score

A naive approach to maximizing the expected score is to continue adding the most efficient remaining team to your list until you run out of budget. But this could leave you with a little bit of unused budget. You might be better off sacrificing a little bit of efficiency so that you can spend your whole budget and get a higher total expected score.

In fact this is an example of the knapsack problem which is known to be NP-hard. There are specialized algorithms that solve the knapsack problem, but this is a particularly easy one and I wanted to play with the available free Python integer program solvers.

You don't actually want that

But that isn't necessarily the best way to maximize either your chance of winning or your expected earnings from the pool, which are more likely to be what you actually want. The right way to do it probably involves understanding how large and diverse your competition is, and throwing in some unpopular picks without sacrificing much efficiency. I don't know how to do that though.

Add in a risk penalty

But even if your goal is something like "get a lot of points in this tournament", picking the set of teams with the largest EV isn't necessarily the right play. The max-EV-set could be very risky in that it's expected to get a bad score most of the time, but occasionally gets a huge score, so that on average the score is pretty high. You might be willing to sacrifice a bit of EV in exchange for reduced risk.

A method for doing this from portfolio optimization is to maximize the expected return plus a variance penalty. You can tune the variance penalty based on your risk tolerance, setting it to zero to get the usual max-EV-set.

Notation

The problem I'll solve can be written as follows:

$$ \max_x \left\{ \mu^T x - \epsilon v^T x : p^T x \leq B, x \in \{0,1\} \right\} $$

The optimization variable is $x$, which is a binary vector of length 64. Each element represents a team, with a 1 indicating that you select that team to be in your set.

$\mu$ is the vector of expected scores for each team, $v$ the variance, and $p$ the price. $B$ is the budget, 2000 in this case. $\epsilon$ is the risk penalty factor.

Solving the optimization problem

I'll use the integer linear program solver ilp from glpk. I use the cvxopt Python interface to glpk to access it.



In [12]:

    
from cvxopt import matrix
from cvxopt.glpk import ilp



In [16]:

    
def solve_binary_program(eps):
    """
    Uses the integer linear program solver ilp from glpk:

    (status, x) = ilp(c, G, h, A, b, I, B)

        minimize    c'*x
        subject to  G*x <= h
                    A*x = b
                    x[k] is integer for k in I
                    x[k] is binary for k in B

    c            nx1 dense 'd' matrix with n>=1
    G            mxn dense or sparse 'd' matrix with m>=1
    h            mx1 dense 'd' matrix
    A            pxn dense or sparse 'd' matrix with p>=0
    b            px1 dense 'd' matrix
    I            set of indices of integer variables
    B            set of indices of binary variables
    """
    c = data.expected_score - eps*data.variance
    c = matrix(c)
    G = matrix(data.price[:, np.newaxis].T, tc='d')
    h = matrix(budget, tc='d')
    A = matrix(np.zeros((1, n)), tc='d')
    b = matrix(0.)
    I = set(range(n))
    B = set(range(n))

    (status, x) = ilp(-c, G, h, A, b, I, B)
    if status != 'optimal':
        raise
    return x

Results

Notice how as we increase the risk penalty, the optimal set of teams increases in size. And of course, both the expected score and the total variance decrease.



In [20]:

    
def solve_and_display(eps=0):
    x = solve_binary_program(eps)
    print('number of teams', sum(x))

    data['selected'] = x
    expected_score = data[data.selected == 1].expected_score.sum()
    total_variance = data[data.selected == 1].variance.sum()
    print('expected score %.2f' % expected_score)
    print('total variance %.2f' % total_variance)
    return data\
        [data.selected == 1]\
        [['team_name', 'team_seed', 'price', 'expected_score', 'variance', 'efficiency']].\
        sort(columns='price', ascending=False)

The maximum expected score solution:



In [21]:

    
solve_and_display()









    



number of teams 15.0
expected score 57.23
total variance 441.66






    Out[21]:






  
    
      
      team_name
      team_seed
      price
      expected_score
      variance
      efficiency
    
  
  
    
      0 
               Kentucky
         1
       500
       18.761645
       144.256509
       0.037523
    
    
      51
              Villanova
         1
       500
       10.798966
        90.892268
       0.021598
    
    
      32
                Arizona
         2
       300
        9.649058
        76.986045
       0.032164
    
    
      66
               Virginia
         2
       300
        7.598104
        71.553645
       0.025327
    
    
      39
                   Utah
         5
       125
        4.123377
        27.834924
       0.032987
    
    
      13
          Wichita State
         7
        95
        2.275967
        11.880365
       0.023958
    
    
      31
             Ohio State
        10
        65
        1.350079
         8.044640
       0.020770
    
    
      10
                  Texas
        11
        60
        1.499421
         7.395559
       0.024990
    
    
      8 
             Valparaiso
        13
        25
        0.481423
         1.126213
       0.019257
    
    
      29
          Georgia State
        14
        20
        0.453335
         1.129402
       0.022667
    
    
      16
       New Mexico State
        15
         5
        0.166190
         0.390123
       0.033238
    
    
      2 
                Hampton
       16b
         1
        0.001266
         0.002283
       0.001266
    
    
      18
       Coastal Carolina
        16
         1
        0.021691
         0.049848
       0.021691
    
    
      36
          Robert Morris
       16b
         1
        0.027626
         0.065626
       0.027626
    
    
      52
              Lafayette
        16
         1
        0.022346
         0.056766
       0.022346

With just a little bit of risk penalty, Villanova drops out of the optimal set. The extra 500 of budget is spent on Gonzaga, North Carolina, and UC Irvine, with only a tiny loss of expected score.

Either this set of teams or the next one might be good choices for the pool, since they expect to get close to the same score without quite as much variance.



In [22]:

    
solve_and_display(eps=.03)









    



number of teams 17.0
expected score 56.87
total variance 416.10






    Out[22]:






  
    
      
      team_name
      team_seed
      price
      expected_score
      variance
      efficiency
    
  
  
    
      0 
               Kentucky
         1
       500
       18.761645
       144.256509
       0.037523
    
    
      32
                Arizona
         2
       300
        9.649058
        76.986045
       0.032164
    
    
      49
                Gonzaga
         2
       300
        6.418647
        44.254151
       0.021395
    
    
      66
               Virginia
         2
       300
        7.598104
        71.553645
       0.025327
    
    
      23
         North Carolina
         4
       175
        3.598615
        20.148501
       0.020564
    
    
      39
                   Utah
         5
       125
        4.123377
        27.834924
       0.032987
    
    
      13
          Wichita State
         7
        95
        2.275967
        11.880365
       0.023958
    
    
      31
             Ohio State
        10
        65
        1.350079
         8.044640
       0.020770
    
    
      10
                  Texas
        11
        60
        1.499421
         7.395559
       0.024990
    
    
      8 
             Valparaiso
        13
        25
        0.481423
         1.126213
       0.019257
    
    
      58
              UC Irvine
        13
        25
        0.421511
         0.925401
       0.016860
    
    
      29
          Georgia State
        14
        20
        0.453335
         1.129402
       0.022667
    
    
      16
       New Mexico State
        15
         5
        0.166190
         0.390123
       0.033238
    
    
      18
       Coastal Carolina
        16
         1
        0.021691
         0.049848
       0.021691
    
    
      2 
                Hampton
       16b
         1
        0.001266
         0.002283
       0.001266
    
    
      36
          Robert Morris
       16b
         1
        0.027626
         0.065626
       0.027626
    
    
      52
              Lafayette
        16
         1
        0.022346
         0.056766
       0.022346

Increasing the risk penalty typically increases the number of teams in the optimal set, spreading the eggs across many baskets to mitigate risk. But the relationship isn't strictly monotone as we see here, going down from 17 to 16 teams after increasing $\epsilon$ from .03 to .05:



In [26]:

    
solve_and_display(eps=.05)









    



number of teams 16.0
expected score 54.39
total variance 364.04






    Out[26]:






  
    
      
      team_name
      team_seed
      price
      expected_score
      variance
      efficiency
    
  
  
    
      0 
               Kentucky
         1
       500
       18.761645
       144.256509
       0.037523
    
    
      32
                Arizona
         2
       300
        9.649058
        76.986045
       0.032164
    
    
      49
                Gonzaga
         2
       300
        6.418647
        44.254151
       0.021395
    
    
      23
         North Carolina
         4
       175
        3.598615
        20.148501
       0.020564
    
    
      39
                   Utah
         5
       125
        4.123377
        27.834924
       0.032987
    
    
      55
          Northern Iowa
         5
       125
        2.231097
        10.160632
       0.017849
    
    
      13
          Wichita State
         7
        95
        2.275967
        11.880365
       0.023958
    
    
      31
             Ohio State
        10
        65
        1.350079
         8.044640
       0.020770
    
    
      10
                  Texas
        11
        60
        1.499421
         7.395559
       0.024990
    
    
      20
         Oklahoma State
         9
        60
        0.947926
         2.718533
       0.015799
    
    
      27
               Ole Miss
       11b
        60
        0.927713
         2.705765
       0.015462
    
    
      61
                 Dayton
       11b
        60
        1.084642
         4.081855
       0.018077
    
    
      8 
             Valparaiso
        13
        25
        0.481423
         1.126213
       0.019257
    
    
      58
              UC Irvine
        13
        25
        0.421511
         0.925401
       0.016860
    
    
      29
          Georgia State
        14
        20
        0.453335
         1.129402
       0.022667
    
    
      16
       New Mexico State
        15
         5
        0.166190
         0.390123
       0.033238



In [27]:

    
solve_and_display(eps=.1)









    



number of teams 26.0
expected score 45.79
total variance 268.87






    Out[27]:






  
    
      
      team_name
      team_seed
      price
      expected_score
      variance
      efficiency
    
  
  
    
      0 
                 Kentucky
         1
       500
       18.761645
       144.256509
       0.037523
    
    
      23
           North Carolina
         4
       175
        3.598615
        20.148501
       0.020564
    
    
      21
                 Arkansas
         5
       125
        1.669730
         5.633805
       0.013358
    
    
      55
            Northern Iowa
         5
       125
        2.231097
        10.160632
       0.017849
    
    
      5 
            West Virginia
         5
       125
        1.846568
         7.557176
       0.014773
    
    
      39
                     Utah
         5
       125
        4.123377
        27.834924
       0.032987
    
    
      13
            Wichita State
         7
        95
        2.275967
        11.880365
       0.023958
    
    
      37
          San Diego State
         8
        85
        1.036467
         4.378130
       0.012194
    
    
      31
               Ohio State
        10
        65
        1.350079
         8.044640
       0.020770
    
    
      44
                     UCLA
        11
        60
        0.716904
         2.787515
       0.011948
    
    
      38
               St. John's
         9
        60
        0.600717
         1.537617
       0.010012
    
    
      27
                 Ole Miss
       11b
        60
        0.927713
         2.705765
       0.015462
    
    
      20
           Oklahoma State
         9
        60
        0.947926
         2.718533
       0.015799
    
    
      10
                    Texas
        11
        60
        1.499421
         7.395559
       0.024990
    
    
      4 
                   Purdue
         9
        60
        0.590854
         1.707142
       0.009848
    
    
      61
                   Dayton
       11b
        60
        1.084642
         4.081855
       0.018077
    
    
      6 
                  Buffalo
        12
        55
        0.633698
         1.796136
       0.011522
    
    
      8 
               Valparaiso
        13
        25
        0.481423
         1.126213
       0.019257
    
    
      42
       Eastern Washington
        13
        25
        0.301771
         0.500441
       0.012071
    
    
      58
                UC Irvine
        13
        25
        0.421511
         0.925401
       0.016860
    
    
      29
            Georgia State
        14
        20
        0.453335
         1.129402
       0.022667
    
    
      16
         New Mexico State
        15
         5
        0.166190
         0.390123
       0.033238
    
    
      2 
                  Hampton
       16b
         1
        0.001266
         0.002283
       0.001266
    
    
      18
         Coastal Carolina
        16
         1
        0.021691
         0.049848
       0.021691
    
    
      36
            Robert Morris
       16b
         1
        0.027626
         0.065626
       0.027626
    
    
      52
                Lafayette
        16
         1
        0.022346
         0.056766
       0.022346

As we increase the risk penalty even further, the optimization problem no longer really suits our purpose. It becomes so afraid of risk that it spends far below the budget, choosing mostly terrible teams that are likely to lose in the first round, contributing very little uncertainty to our result, but also very little value.



In [47]:

    
solve_and_display(eps=.4)









    



number of teams 14.0
expected score 3.00
total variance 6.62






    Out[47]:






  
    
      
      team_name
      team_seed
      price
      expected_score
      variance
      efficiency
    
  
  
    
      22
                  Wofford
        12
       55
       0.332589
       0.586285
       0.006047
    
    
      56
                  Wyoming
        12
       55
       0.439177
       1.026169
       0.007985
    
    
      8 
               Valparaiso
        13
       25
       0.481423
       1.126213
       0.019257
    
    
      42
       Eastern Washington
        13
       25
       0.301771
       0.500441
       0.012071
    
    
      58
                UC Irvine
        13
       25
       0.421511
       0.925401
       0.016860
    
    
      12
             Northeastern
        14
       20
       0.153953
       0.360553
       0.007698
    
    
      29
            Georgia State
        14
       20
       0.453335
       1.129402
       0.022667
    
    
      63
                   Albany
        14
       20
       0.149620
       0.355647
       0.007481
    
    
      16
         New Mexico State
        15
        5
       0.166190
       0.390123
       0.033238
    
    
      33
           Texas Southern
        15
        5
       0.008298
       0.014673
       0.001660
    
    
      50
       North Dakota State
        15
        5
       0.038325
       0.090304
       0.007665
    
    
      2 
                  Hampton
       16b
        1
       0.001266
       0.002283
       0.001266
    
    
      18
         Coastal Carolina
        16
        1
       0.021691
       0.049848
       0.021691
    
    
      36
            Robert Morris
       16b
        1
       0.027626
       0.065626
       0.027626



In [48]:

    
import matplotlib.pyplot as plt
%matplotlib inline



In [53]:

    
f = lambda eps: sum(solve_binary_program(eps))
eps = np.linspace(0, .75)
num_teams = [f(_) for _ in eps]
plt.plot(eps, num_teams)
plt.ylim(0, 40)
plt.xlabel('Risk penalty $\epsilon$')
plt.ylabel('Optimal number of teams');

Appendix

Installing cvxopt and glpk

I think this should do the trick.

Ubuntu

sudo apt-get install libglpk36

export CVXOPT_BUILD_GLPK=1
export CVXOPT_GLPK_LIB_DIR=/usr/lib/x86_64-linux-gnu/
export CVXOPT_GLPK_INC_DIR=/usr/include/
pip install cvxopt

python
> from cvxopt.glpk import ilp

Mac

brew install homebrew/science/glpk

export CVXOPT_BUILD_GLPK=1
export CVXOPT_GLPK_LIB_DIR=/usr/local/Cellar/glpk/4.52/lib
export CVXOPT_GLPK_INC_DIR=/usr/local/Cellar/glpk/4.52/include
pip install cvxopt

python
> from cvxopt.glpk import ilp

	team_name	team_seed	1	2	3	4	5	6	7
0	Kentucky	1	1	0.998983	0.940664	0.855849	0.732191	5.351611e-01	4.131957e-01
2	Hampton	16b	1	0.001017	0.000103	0.000012	0.000001	5.387219e-08	4.307362e-09
3	Cincinnati	8	1	0.536887	0.034189	0.017280	0.007178	2.065072e-03	7.358615e-04
4	Purdue	9	1	0.463113	0.025044	0.012093	0.004774	1.382256e-03	4.958285e-04
5	West Virginia	5	1	0.682463	0.392345	0.052872	0.025248	7.441731e-03	2.693933e-03

	team_name	team_seed	0	1	2	3	4	5	6
0	Kentucky	1	0.001017	0.058320	0.084815	0.123658	0.197030	1.219654e-01	4.131957e-01
2	Hampton	16b	0.998983	0.000913	0.000091	0.000011	0.000001	4.956483e-08	4.307362e-09
3	Cincinnati	8	0.463113	0.502698	0.016909	0.010102	0.005113	1.329211e-03	7.358615e-04
4	Purdue	9	0.536887	0.438069	0.012951	0.007319	0.003392	8.864273e-04	4.958285e-04
5	West Virginia	5	0.317537	0.290118	0.339473	0.027624	0.017807	4.747797e-03	2.693933e-03

	team_name	team_seed	price	expected_score	variance	efficiency
0	Kentucky	1	500	18.761645	144.256509	0.037523
16	New Mexico State	15	5	0.166190	0.390123	0.033238
39	Utah	5	125	4.123377	27.834924	0.032987
32	Arizona	2	300	9.649058	76.986045	0.032164
36	Robert Morris	16b	1	0.027626	0.065626	0.027626

	team_name	team_seed	price	expected_score	variance	efficiency
22	Wofford	12	55	0.332589	0.586285	0.006047
56	Wyoming	12	55	0.439177	1.026169	0.007985
8	Valparaiso	13	25	0.481423	1.126213	0.019257
42	Eastern Washington	13	25	0.301771	0.500441	0.012071
58	UC Irvine	13	25	0.421511	0.925401	0.016860
12	Northeastern	14	20	0.153953	0.360553	0.007698
29	Georgia State	14	20	0.453335	1.129402	0.022667
63	Albany	14	20	0.149620	0.355647	0.007481
16	New Mexico State	15	5	0.166190	0.390123	0.033238
33	Texas Southern	15	5	0.008298	0.014673	0.001660
50	North Dakota State	15	5	0.038325	0.090304	0.007665
2	Hampton	16b	1	0.001266	0.002283	0.001266
18	Coastal Carolina	16	1	0.021691	0.049848	0.021691
36	Robert Morris	16b	1	0.027626	0.065626	0.027626