Before running this notebook, run the 2 cells containing supporting functions at the bottom.

Daily Fantasy Sports and Quantitative Finance

A case study on using quantitative finance tools to build daily fantasy sports lineups.

Daily fantasy sports (DFS) have emerged in recent years as a popular skill game. Usually, games challenge participants to pick a team of players (called a 'lineup') in a particular sport on a given day with the objective of getting the most game points on that day. In the classic DFS format, players are assigned a fictional 'salary' or cost and the DFS participant has to pick a roster of players who fit under a total cost constraint, commonly referred to as a 'salary cap'. Players are ranked at the end of the day based on a sport-dependent scoring function.

In this notebook, we will take a look at how quantitative finance tools can be used to study daily fantasy sports data and build an optimal lineup using convex optimization. Specifically, we will take a look at the NBA (basketball) game on DraftKings. The data used in this notebooks is obtained from https://stats.nba.com and DraftKings (DK). The tools used are a mix of tools from the Quantopian API as well as other Python libraries.

Load and Format Data

The first step to most quantitative problems is to get data. In this study, the data was collected and uploaded to the Quantopian research environment as a .csv file. Let's load it into a pandas DataFrame using local_csv (a Quantopian-specific function).


In [2]:
import pandas as pd

In [3]:
df = local_csv('nba_data.csv')

In [4]:
df['game_datetime'] = pd.to_datetime(df['game_date'])

In [5]:
df = df.set_index(['game_datetime', 'player_id'])

It is important to know what our data looks like, so here is a preview of the first few rows of our DataFrame.


In [6]:
df.head(3)


Out[6]:
game_id season_id game_date matchup wl min fgm fga fg_pct fg3m ... height weight birth_date age exp school dk_score dk_salary dk_position dk_position_codes
game_datetime player_id
2018-04-11 1713 21701230 22017 2018-04-11 SAC vs. HOU W 25 5 8 0.625 2 ... 6-6 220 JAN 26, 1977 41.0 19 North Carolina 17.75 NaN NaN NaN
2018-04-09 1713 21701210 22017 2018-04-09 SAC @ SAS L 17 1 2 0.500 1 ... 6-6 220 JAN 26, 1977 41.0 19 North Carolina 14.50 NaN NaN NaN
2018-04-06 1713 21701188 22017 2018-04-06 SAC @ MEM W 13 0 4 0.000 0 ... 6-6 220 JAN 26, 1977 41.0 19 North Carolina 9.75 3400.0 SG/SF [0, 1, 0, 1, 0, 1, 1, 1]

3 rows × 42 columns

And here is a look at a single row where we can see all of the columns.


In [7]:
df.iloc[10]


Out[7]:
game_id                              21701057
season_id                               22017
game_date                          2018-03-19
matchup                           SAC vs. DET
wl                                          L
min                                        17
fgm                                         3
fga                                         6
fg_pct                                    0.5
fg3m                                        1
fg3a                                        4
fg3_pct                                  0.25
ftm                                         0
fta                                         0
ft_pct                                      0
oreb                                        1
dreb                                        0
reb                                         1
ast                                         0
stl                                         2
blk                                         1
tov                                         2
pf                                          3
pts                                         7
plus_minus                                 -3
video_available                             1
team_id                            1610612758
season                                   2017
league_id                                   0
player                           Vince Carter
num                                        15
position                                  G-F
height                                    6-6
weight                                    220
birth_date                       JAN 26, 1977
age                                        41
exp                                        19
school                         North Carolina
dk_score                                13.75
dk_salary                                3300
dk_position                             SG/SF
dk_position_codes    [0, 1, 0, 1, 0, 1, 1, 1]
Name: (2018-03-19 00:00:00, 1713), dtype: object

Alphalens (Factor Analysis)

Alphalens is a tool on Quantopian for analyzing the predictve ability of a factor. In this notebook, we will use Alphalens to analyze the predictive ability of a fantasy points per game factor.

The first step to analzing our factor is to define it. Here, we will define the fantasy points per game factor, trailing_20_dk_score to be the rolling mean fantasy points per game (using DK scoring rules) over the last 20 days.


In [8]:
# Rolling 20 day mean fantasy score per game.
trailing_20_dk_score = df['dk_score'].unstack().shift(1).rolling(20, min_periods=5).mean().stack()

In [14]:
trailing_20_dk_score.head()


Out[14]:
game_datetime  player_id
2017-10-26     1717         20.10
               2037         20.70
               2199         22.40
               2207         15.50
               2544         55.95
dtype: float64

And then we need to define our objective metric, or 'what we want to maximize'. In finance, we want to maximize portfolio returns. In DraftKings DFS, we want to maximize our DraftKings score for the current day.


In [11]:
# DK scores.
scores = df['dk_score'].unstack()

Next, we need to import Alphalens so that we can use it to analyze our factors.


In [12]:
import alphalens as al

Before using Alphalens, we need to format our data in a way that it expects. Let's start with our points factor. The below function creates a DataFrame using our points factor and our DK scores. It also specifies periods to be 1, meaning we only care about the performance of the player on one day. The data will be categorized into 5 'quantiles' to remove some of the noise from the relationship.

Now, let's plot the relationship between each of our points per game factor quantiles and see how strongly they predict DK points on a given day.


In [13]:
# Format rolling fantasy points per game data for use in Alphalens.
factor_data = get_clean_factor_and_forward_scores(
    factor=trailing_20_dk_score,
    scores=scores.loc['2017-11-09':],
    periods=[1],
)

In [14]:
mean_by_q_daily, std_err_by_q_daily = al.performance.mean_return_by_quantile(factor_data,
                                                                                       by_date=True,
                                                                                       demeaned=False)

In [15]:
al.plotting.plot_quantile_returns_violin(mean_by_q_daily/10000);


It looks like our fantasy points per game factor is a strong predictor. We see a narrow and distinct distribution around each quantile's mean DK points (y axis).

Let's look at another visualization of the points per game factor and plot out the daily mean of each quantile.


In [16]:
quant_spread, std_err_spread = al.performance.compute_mean_returns_spread(mean_by_q_daily,
                                                                                 upper_quant=5,
                                                                                 lower_quant=1,
                                                                                 std_err=std_err_by_q_daily)

In [17]:
import matplotlib.cm as cm
import matplotlib.pyplot as plt

plt.figure(figsize=(14,8))
quantiles = mean_by_q_daily.index.get_level_values(0)
colors = cm.rainbow(np.linspace(0, 1, len(quantiles.unique())))
for quantile in quantiles.unique():
    x = mean_by_q_daily[quantiles == quantile].index.get_level_values(1)
    y = mean_by_q_daily[quantiles == quantile]
    plt.plot(x, y, color=colors[quantile-1])
    
ax = plt.gca()
ax.legend(quantiles.unique())
ax.set_ylabel('DK Points')
ax.set_xlabel('Date')

plt.title('Daily Mean DK Points Per Quantile (Points Per Game Factor)')


Out[17]:
<matplotlib.text.Text at 0x7f7a398f2a90>

Optimize (Portfolio Optimization)

Now that we have identified some strong predictors of fantasy points, let's use the data to build an optimal portfolio each day. Specifically, we will pick a team of players (construct a portfolio) based on their mean points per game over the last 20 days.

On Quantopian, we can use the Optimize API to help us calculate an optimal 'portfolio' each day. In finance, this means picking the stocks that we should hold each day based on an objective function and a set of constraints. In daily fantasy sports, the problem is the same: we want to pick a lineup based on an objective function (maximize DK points) subject to a series of constraints (based on the rules of the game).

Generally speaking, constraints are the easy part. The rules are set before every game and we know that our lineup has to follow these rules. Specifically, the NBA game at DraftKings has the following rules:

  • The lineup must be composed of exactly 8 players.
  • The sum of the fictional player salaries in a lineup must be less than 50,000.
  • A lineup must include players from at least two NBA games.
  • A lineup must include at least one player classified in each of the following positions:
    • PG
    • SG
    • PF
    • SF
    • C
    • G (PG or SG, in addition to the other PG and SG slots)
    • F (PF or SF, in addition to the other PF and SF slots)
    • U (the last spot can be occupied by a player of any position).

The challenging part is creating an objective function. We know that we want to maximize the number of DK points that we get on a given day, but without time travel, it's difficult to know exactly what the function looks like. In practice, this means we need to define an 'expected value' objective that represents our expected performance of each player. Earlier, we saw that our trailing points per game factor was a good predictor of DK points, so let's use that as our objective function.

As an optimization problem, this means that we want to pick a team of players subject to the DK rules (salary cap, positions, etc.) that maximizes the total trailing points per game factor.

Let's define our problem using the Optimize API.


In [18]:
# Import the Optimize namespace.
import quantopian.optimize as opt

Let's start by assembling and formatting the player salaries and positions so that we can use them as constraints in our optimization problem.


In [19]:
# Player costs.
player_costs = pd.DataFrame(df['dk_salary']).dropna().sort_index(level=['game_datetime', 'player_id'])

# Player positions.
player_positions = df['dk_position_codes'].dropna().apply((lambda x: [int(y) for y in x[1:-1].split(',')]))

Then, let's restructure our (trailing_20_dk_score) so that we can use it as an objective function in Optimize. We'll define it to be parametric on the date so that we can play around with the date later on.


In [20]:
# Define the game date that we want to test. Change this and re-run 
# cells below to build a lineup for another date.
TEST_DATE = '2017-12-01'

In [21]:
# Format our expected value factor (trailing_20_dk_score) for use in Optimize.
expected_value = trailing_20_dk_score[player_costs.index].dropna()
expected_value = expected_value[expected_value.index.get_level_values(0) == TEST_DATE]

Objective

Next, let's define our objective function using optimize.MaximizeAlpha, which will try to maximize the value of our lineup based on our expected_value factor.


In [22]:
# Objective function to be fed into Optimize later on.
obj = opt.MaximizeAlpha(expected_value)

Constraints

Many of the built-in constraints are centered around finance problems, so we will have to make some simplifying assumptions in order to use them to build our lineup. Later in the notebook, we will build our own optimizer to build a lineup that strictly follows the DK rules.

Let's start by defining our salary cap constraint:


In [23]:
# Salary cap constraint. The total cost of our lineup has to be less than 50,000.
# We will define this as a FactorExposure constraint.
cost_limit = opt.FactorExposure(
    player_costs, 
    min_exposures={'dk_salary': 0}, 
    max_exposures={'dk_salary': 50000}
)

Next, we'll define a position constraint. To simplify things, we'll use more generic positions, and limit ourselves to a maximum of 4 forwards (F), 4 guards (G), and 2 centers (C), with a minimum of 3 F, 3 G, and 1 C.


In [24]:
# Map from each player to its position.
labels = df['position'].apply(lambda x: x[0])

# # Maps from each position to its min/max exposure.
min_exposures = {'F': 3, 'G': 3, 'C': 1}
max_exposures = {'F': 4, 'G': 4, 'C': 2}

player_position_constraint = opt.NetGroupExposure(labels, min_exposures, max_exposures)

At this point, it's worth noting that Optimize operates in weight space, meaning it chooses a percentage (not bounded by a total of 100%, can also be positive or negative) of the overall lineup to allocate to each asset/player. In an attempt to get around that, the next constraint we define will specify that we can hold no more than 1 of each player, and no fewer than 0 of each player (you can't short sell a player... yet).


In [25]:
# This constraints tells the Optimizer than we can hold at most 1 of each player
discretize_players = opt.PositionConcentration(
    pd.Series([]),
    pd.Series([]),
    default_min_weight=0,
    default_max_weight=1,
)

Lastly, let's define a constraint for our total number of players.


In [26]:
max_players = opt.MaxGrossExposure(8)

Calculate Optimal Portfolio (Build Lineup)

The next step is to calculate our optimal portfolio for the day using our objective and constraints.


In [27]:
result = opt.calculate_optimal_portfolio(
    objective=obj,
    constraints=[
        discretize_players,
        player_position_constraint,
        max_players,
        cost_limit,
    ]
)

And here are the resulting weights:


In [28]:
resulting_picks = result[(result.index.get_level_values(0) == TEST_DATE)]
player_weights = resulting_picks[resulting_picks>0]

In [29]:
lineup_info = df.loc[player_weights.index][['matchup', 'player', 'position', 'dk_salary']]

In [30]:
lineup_info['weight'] = player_weights

In [31]:
lineup_info


Out[31]:
matchup player position dk_salary weight
game_datetime player_id
2017-12-01 200768 TOR vs. IND Kyle Lowry G 8100.0 1.000000
201188 MEM vs. SAS Marc Gasol C 7800.0 1.000000
201988 SAS @ MEM Patty Mills G 4100.0 1.000000
202331 OKC vs. MIN Paul George F 8200.0 1.000000
202692 UTA vs. NOP Alec Burks G 3600.0 0.459459
202710 MIN @ OKC Jimmy Butler G 7300.0 0.540541
202711 IND @ TOR Bojan Bogdanovic F 4600.0 1.000000
202734 NOP @ UTA E'Twaun Moore F-G 3600.0 1.000000
203506 IND @ TOR Victor Oladipo G 8000.0 1.000000

Something that immediately pops out is the non-binary weight assigned to Kemba Walker. Obviously, it's not possible to pick 0.55% of a player in DFS, so this lineup won't be eligible to enter. Unfortunately, it's not currently possible to binarize the outputs of Optimize on Quantopian (since it's not a helpful feature in quantitative finance).

CVXPY

To get around the limits of Optimize, we can take a similar approach and use cvxpy, a Python library for solving convex optimization problems (and the library on top of which Optimize was built), to build our own objective and constraints. In this version, we will follow the exact rules of the DraftKings NBA game (8 players, \$50k salary cap, players in at least two games, and one player in each of the following positions: ['PG', 'SG', 'PF', 'SF', 'C', 'G', 'F', 'U']. The following cells build constraints to enforce these rules.


In [32]:
# List of DK positions.
DK_POSITION_LIST = ['PG', 'SG', 'PF', 'SF', 'C', 'G', 'F', 'U']

In [36]:
# Get the average play time for all players over the last 20 games. We will use this to
# filter out players with skewed stats due to playing very little.
trailing_20_mins = df['min'].unstack().shift(1).rolling(20, min_periods=5).mean().stack()

In [37]:
# Players with non-null salary on TEST_DATE.
have_salary = player_costs[player_costs.index.get_level_values(0) == TEST_DATE].dropna().index

# Values from our factor for TEST_DATE.
trailing_dk_score_today = trailing_20_dk_score[
    (trailing_20_dk_score.index.get_level_values(0) == TEST_DATE)
]

# Players with at least 5 minutes per game on average over the last 20 days.
trailing_mins_today = trailing_20_mins[(trailing_20_mins.index.get_level_values(0) == TEST_DATE)]
have_play_time = trailing_mins_today[trailing_mins_today >= 5].index

# Eligible players for us to pick have a non-null salary and at least 5 minutes
# played per game over the last 20 days.
eligible_players_today = have_salary.intersection(have_play_time)

# The game ID in which each player is playing.
player_games = df.loc[eligible_players_today].game_id

# The set of all game IDs on TEST_DATE
todays_games = df.loc[eligible_players_today].game_id.unique().tolist()

Next, we will import cvxpy and define our objective and constraints.


In [38]:
import cvxpy as cvx

In [39]:
# Player salaries and expected values.
salaries = np.squeeze(np.array(player_costs.loc[eligible_players_today]))
values = np.array(expected_value[eligible_players_today])

# The variable we are solving for. We define our output variable as a Bool
# since we have to make a binary decision on each player (pick or don't pick).
selection = cvx.Bool(len(salaries))
selection.is_positive = new_is_positive

# Our lineup's total salary must be less than 50,000.
salary_cap = 50000
cost_constraint = salaries * selection <= salary_cap

# Our lineup must be composed of exactly 8 players.
player_constraint = np.ones(len(salaries)) * selection == 8

# Our total expected value is the sum of the value of each player in
# the lineup. We define our objective to maximize the total expected
# value.
total_expected_value = values * selection
objective = cvx.Maximize(total_expected_value)

# Put our cost and player count constraints in a list.
constraints = [cost_constraint, player_constraint]

# Define our position constraints. Positions are represented along an 8-element
# array corresponding to the positions in DK_POSITION_LIST.
position_min = np.array([1, 1, 1, 1, 1, 3, 3, 8])
pos_limits = {}
i = 0
for pos in DK_POSITION_LIST:
    pos_limits[pos] = np.array(player_positions[eligible_players_today].apply(lambda x: x[i]))
    constraints.append((pos_limits[pos] * selection) >= position_min[i])
    i += 1

# Define our game constraints. We rephrase the rule as 'you cannot pick more than
# 7 players from any one game'.
for gid in todays_games:
    game_limit = np.array(player_games == gid)
    constraints.append((game_limit * selection) <= 7)
    
    

# We tell cvxpy that we want maximize our expected value, subject to all
# of our constraints.
optimization_problem = cvx.Problem(objective, constraints)

print "Our total expected value from today's lineup is:"

# Solving the problem.
optimization_problem.solve(solver=cvx.ECOS_BB)


Our total expected value from today's lineup is:
Out[39]:
273.37121208187256

Based on our player-by-player expected values and the constraints we supplied to cvxpy, our optimal lineup has a total expected value of 273.37.

Let's take a look at who is in this lineup to make sure that we implemented the rules properly:


In [40]:
# Format output and get relevant player info for display.
dk_team = pd.Series(np.squeeze(selection.value).tolist()[0])
player_info = df.loc[player_costs.loc[eligible_players_today].iloc[dk_team[dk_team > 0.1].index.values].index][['matchup', 'player', 'dk_position', 'dk_salary', 'dk_score']]

In [41]:
player_info


Out[41]:
matchup player dk_position dk_salary dk_score
game_datetime player_id
2017-12-01 200768 TOR vs. IND Kyle Lowry PG 8100.0 31.75
201959 MIN @ OKC Taj Gibson PF/C 5700.0 16.00
202331 OKC vs. MIN Paul George SF 8200.0 62.00
202692 UTA vs. NOP Alec Burks PG/SG 3600.0 35.75
202710 MIN @ OKC Jimmy Butler SG 7300.0 45.25
203087 CHA @ MIA Jeremy Lamb SG/SF 5700.0 24.50
203506 IND @ TOR Victor Oladipo PG/SG 8000.0 66.25
1627832 TOR vs. IND Fred VanVleet PG 3400.0 27.50

In [48]:
print "Total lineup salary: %d" % player_info['dk_salary'].sum()
print "Total actual score: %d" % player_info['dk_score'].sum()


Total lineup salary: 50000
Total actual score: 309

Our roster appears to satisfy the rules!

Backtest (a.k.a. run the above code over several consecutive days)


In [42]:
def backtest(factor, dates, filters, df, player_costs, player_positions):
    historical_results = {}
    
    daily_expected_value = factor[filters].dropna()
    
    for _date in dates:
        try:
            print _date
            daily_filter = filters[filters.get_level_values(0) == _date]
            expected_value_today = daily_expected_value[daily_expected_value.index.get_level_values(0) == _date]

            # The game ID in which each player is playing.
            player_games = df.loc[daily_filter].game_id

            # The set of all game IDs on the current date.
            todays_games = df.loc[daily_filter].game_id.unique().tolist()

            # Player salaries and expected values.
            salaries = np.squeeze(np.array(player_costs.loc[daily_filter]))
            values = np.array(expected_value_today[daily_filter])

            # The variable we are solving for. We define our output variable as a Bool
            # since we have to make a binary decision on each player (pick or don't pick).
            selection = cvx.Bool(len(salaries))
            selection.is_positive = new_is_positive

            # Our lineup's total salary must be less than 50,000.
            salary_cap = 50000
            cost_constraint = salaries * selection <= salary_cap

            # Our lineup must be composed of exactly 8 players.
            player_constraint = np.ones(len(salaries)) * selection == 8

            # Our total expected value is the sum of the value of each player in
            # the lineup. We define our objective to maximize the total expected
            # value.
            total_expected_value = values * selection
            objective = cvx.Maximize(total_expected_value)

            # Put our cost and player count constraints in a list.
            constraints = [cost_constraint, player_constraint]

            # Define our position constraints. Positions are represented along an 8-element
            # array corresponding to the positions in DK_POSITION_LIST.
            position_min = np.array([1, 1, 1, 1, 1, 3, 3, 8])
            pos_limits = {}
            i = 0
            for pos in DK_POSITION_LIST:
                pos_limits[pos] = np.array(player_positions[daily_filter].apply(lambda x: x[i]))
                constraints.append((pos_limits[pos] * selection) >= position_min[i])
                i += 1

            # Define our game constraints. We rephrase the rule as 'you cannot pick more than
            # 7 players from any one game'.
            for gid in todays_games:
                game_limit = np.array(player_games == gid)
                constraints.append((game_limit * selection) <= 7)


            # We tell cvxpy that we want maximize our expected value, subject to all
            # of our constraints.
            knapsack_problem = cvx.Problem(objective, constraints)

            # Solving the problem.
            predicted_value = knapsack_problem.solve(solver=cvx.ECOS_BB)

            dk_team = pd.Series(np.squeeze(selection.value).tolist()[0])
            player_info = df.loc[player_costs.loc[filters].iloc[dk_team[dk_team > 0.1].index.values].index][['matchup', 'player', 'dk_position', 'dk_salary', 'dk_score']]
            historical_results[_date] = {
                'actual_score': player_info.dk_score.sum(),
                'expected_score': predicted_value,
            }
        except TypeError:
            pass
        except ValueError:
            pass
    
    return historical_results

In [45]:
test_dates = trailing_20_dk_score.index.get_level_values(0).unique().values[15:]
has_playtime = trailing_20_mins[trailing_20_mins >= 5]

In [46]:
filters = player_costs.index & has_playtime.index
backtest_result = backtest(
    trailing_20_dk_score,
    test_dates,
    filters,
    df,
    player_costs,
    player_positions,
)


2017-11-10T00:00:00.000000000
2017-11-11T00:00:00.000000000
2017-11-12T00:00:00.000000000
2017-11-13T00:00:00.000000000
2017-11-14T00:00:00.000000000
2017-11-15T00:00:00.000000000
2017-11-16T00:00:00.000000000
2017-11-17T00:00:00.000000000
2017-11-18T00:00:00.000000000
2017-11-19T00:00:00.000000000
2017-11-20T00:00:00.000000000
2017-11-21T00:00:00.000000000
2017-11-22T00:00:00.000000000
2017-11-24T00:00:00.000000000
2017-11-25T00:00:00.000000000
2017-11-26T00:00:00.000000000
2017-11-27T00:00:00.000000000
2017-11-28T00:00:00.000000000
2017-11-29T00:00:00.000000000
2017-11-30T00:00:00.000000000
2017-12-01T00:00:00.000000000
2017-12-02T00:00:00.000000000
2017-12-03T00:00:00.000000000
2017-12-04T00:00:00.000000000
2017-12-05T00:00:00.000000000
2017-12-06T00:00:00.000000000
2017-12-07T00:00:00.000000000
2017-12-08T00:00:00.000000000
2017-12-09T00:00:00.000000000
2017-12-10T00:00:00.000000000
2017-12-11T00:00:00.000000000
2017-12-12T00:00:00.000000000
2017-12-13T00:00:00.000000000
2017-12-14T00:00:00.000000000
2017-12-15T00:00:00.000000000
2017-12-16T00:00:00.000000000
2017-12-17T00:00:00.000000000
2017-12-18T00:00:00.000000000
2017-12-19T00:00:00.000000000
2017-12-20T00:00:00.000000000
2017-12-21T00:00:00.000000000
2017-12-22T00:00:00.000000000
2017-12-23T00:00:00.000000000
2017-12-25T00:00:00.000000000
2017-12-26T00:00:00.000000000
2017-12-27T00:00:00.000000000
2017-12-28T00:00:00.000000000
2017-12-29T00:00:00.000000000
2017-12-30T00:00:00.000000000
2017-12-31T00:00:00.000000000
2018-01-01T00:00:00.000000000
2018-01-02T00:00:00.000000000
2018-01-03T00:00:00.000000000
2018-01-04T00:00:00.000000000
2018-01-05T00:00:00.000000000
2018-01-06T00:00:00.000000000
2018-01-07T00:00:00.000000000
2018-01-08T00:00:00.000000000
2018-01-09T00:00:00.000000000
2018-01-10T00:00:00.000000000
2018-01-11T00:00:00.000000000
2018-01-12T00:00:00.000000000
2018-01-13T00:00:00.000000000
2018-01-14T00:00:00.000000000
2018-01-15T00:00:00.000000000
2018-01-16T00:00:00.000000000
2018-01-17T00:00:00.000000000
2018-01-18T00:00:00.000000000
2018-01-19T00:00:00.000000000
2018-01-20T00:00:00.000000000
2018-01-21T00:00:00.000000000
2018-01-22T00:00:00.000000000
2018-01-23T00:00:00.000000000
2018-01-24T00:00:00.000000000
2018-01-25T00:00:00.000000000
2018-01-26T00:00:00.000000000
2018-01-27T00:00:00.000000000
2018-01-28T00:00:00.000000000
2018-01-29T00:00:00.000000000
2018-01-30T00:00:00.000000000
2018-01-31T00:00:00.000000000
2018-02-01T00:00:00.000000000
2018-02-02T00:00:00.000000000
2018-02-03T00:00:00.000000000
2018-02-04T00:00:00.000000000
2018-02-05T00:00:00.000000000
2018-02-06T00:00:00.000000000
2018-02-07T00:00:00.000000000
2018-02-08T00:00:00.000000000
2018-02-09T00:00:00.000000000
2018-02-10T00:00:00.000000000
2018-02-11T00:00:00.000000000
2018-02-12T00:00:00.000000000
2018-02-13T00:00:00.000000000
2018-02-14T00:00:00.000000000
2018-02-15T00:00:00.000000000
2018-02-22T00:00:00.000000000
2018-02-23T00:00:00.000000000
2018-02-24T00:00:00.000000000
2018-02-25T00:00:00.000000000
2018-02-26T00:00:00.000000000
2018-02-27T00:00:00.000000000
2018-02-28T00:00:00.000000000
2018-03-01T00:00:00.000000000
2018-03-02T00:00:00.000000000
2018-03-03T00:00:00.000000000
2018-03-04T00:00:00.000000000
2018-03-05T00:00:00.000000000
2018-03-06T00:00:00.000000000
2018-03-07T00:00:00.000000000
2018-03-08T00:00:00.000000000
2018-03-09T00:00:00.000000000
2018-03-10T00:00:00.000000000
2018-03-11T00:00:00.000000000
2018-03-12T00:00:00.000000000
2018-03-13T00:00:00.000000000
2018-03-14T00:00:00.000000000
2018-03-15T00:00:00.000000000
2018-03-16T00:00:00.000000000
2018-03-17T00:00:00.000000000
2018-03-18T00:00:00.000000000
2018-03-19T00:00:00.000000000
2018-03-20T00:00:00.000000000
2018-03-21T00:00:00.000000000
2018-03-22T00:00:00.000000000
2018-03-23T00:00:00.000000000
2018-03-24T00:00:00.000000000
2018-03-25T00:00:00.000000000
2018-03-26T00:00:00.000000000
2018-03-27T00:00:00.000000000
2018-03-28T00:00:00.000000000
2018-03-29T00:00:00.000000000
2018-03-30T00:00:00.000000000
2018-03-31T00:00:00.000000000
2018-04-01T00:00:00.000000000
2018-04-03T00:00:00.000000000
2018-04-04T00:00:00.000000000
2018-04-05T00:00:00.000000000
2018-04-06T00:00:00.000000000
2018-04-07T00:00:00.000000000
2018-04-08T00:00:00.000000000
2018-04-09T00:00:00.000000000
2018-04-10T00:00:00.000000000
2018-04-11T00:00:00.000000000

In [47]:
daily_results = pd.DataFrame.from_dict(backtest_result, orient='index')
print daily_results.mean()
daily_results.plot();


actual_score      189.060714
expected_score    276.071713
dtype: float64

Conclusion

In this notebook, we looked at daily fantasy sports using quantitative finance tools. We started out by pulling in and looking at NBA data. We created and tested the ability of certain statistics to predict DK points using Alphalens. We attempted to build an optimal lineup using Optimize. And lastly, we used cvxpy to build a proper solution to the daily lineup construction problem.

Future Work

In both quantitative finance and daily fantasy sports, the hardest problem is forecasting results of an asset or player in the future. Now that we have a notebook built to help us test factors and build lineups, the next step is to iterate on ideas of predictive signals. In this example, we only looked at two relatively simple metrics (minutes played and points scored). It would be interesting to try new factors to see if they are predictive or provide a leg up on the competitiong.

Another area that we didn't explore in this notebook is risk. In most team sports, the measured statistics of one player tend to be correlated (positively or negatively) with other players in the game. For example, players might get fantasy points for getting an assist when a teammate scores. These correlations can be defined as risk factors. Risk factors can be exploited or hedged depending on the desired risk profile of the lineup. The objective function or constraints in the optimization problem could be altered to either try to exploit or hedge risk.


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:

Helper Functions - Cells Below Need to Be Run at Notebook Start


In [15]:
import numpy as np
from pandas.tseries.offsets import CustomBusinessDay

from scipy.stats import mode

def compute_forward_scores(factor,
                            scores,
                            periods=(1, 5, 10),
                            filter_zscore=None):
    """
    Finds the N period forward returns (as percent change) for each asset
    provided.
    Parameters
    ----------
    factor : pd.Series - MultiIndex
        A MultiIndex Series indexed by timestamp (level 0) and asset
        (level 1), containing the values for a single alpha factor.
        - See full explanation in utils.get_clean_factor_and_forward_returns
    scores : pd.DataFrame
        Pricing data to use in forward price calculation.
        Assets as columns, dates as index. Pricing data must
        span the factor analysis time period plus an additional buffer window
        that is greater than the maximum number of expected periods
        in the forward returns calculations.
    periods : sequence[int]
        periods to compute forward returns on.
    filter_zscore : int or float, optional
        Sets forward returns greater than X standard deviations
        from the the mean to nan. Set it to 'None' to avoid filtering.
        Caution: this outlier filtering incorporates lookahead bias.
    Returns
    -------
    forward_returns : pd.DataFrame - MultiIndex
        A MultiIndex DataFrame indexed by timestamp (level 0) and asset
        (level 1), containing the forward returns for assets.
        Forward returns column names follow the format accepted by
        pd.Timedelta (e.g. '1D', '30m', '3h15m', '1D1h', etc).
        'date' index freq property (forward_returns.index.levels[0].freq)
        will be set to a trading calendar (pandas DateOffset) inferred
        from the input data (see infer_trading_calendar for more details).
    """

    factor_dateindex = factor.index.levels[0]
    if factor_dateindex.tz != scores.index.tz:
        raise NonMatchingTimezoneError("The timezone of 'factor' is not the "
                                       "same as the timezone of 'scores'. See "
                                       "the pandas methods tz_localize and "
                                       "tz_convert.")

    freq = infer_trading_calendar(factor_dateindex, scores.index)

    factor_dateindex = factor_dateindex.intersection(scores.index)

    if len(factor_dateindex) == 0:
        raise ValueError("Factor and scores indices don't match: make sure "
                         "they have the same convention in terms of datetimes "
                         "and symbol-names")

    forward_returns = pd.DataFrame(index=pd.MultiIndex.from_product(
        [factor_dateindex, scores.columns], names=['date', 'asset']))

    forward_returns.index.levels[0].freq = freq

    for period in sorted(periods):
        #
        # build forward returns
        #
        fwdret = (scores
                  #.shift(-period)
                  .reindex(factor_dateindex)
                  )

        if filter_zscore is not None:
            mask = abs(fwdret - fwdret.mean()) > (filter_zscore * fwdret.std())
            fwdret[mask] = np.nan

        # Find the period length, which will be the column name
        # Becase the calendar inferred from factor and prices doesn't take
        # into consideration holidays yet, there could be some non-trading days
        # in between the trades so we'll test several entries to find out the
        # correct period length
        #
        entries_to_test = min(10, len(fwdret.index), len(scores.index)-period)
        days_diffs = []
        for i in range(entries_to_test):
            p_idx = scores.index.get_loc(fwdret.index[i])
            start = scores.index[p_idx]
            end = scores.index[p_idx+period]
            period_len = diff_custom_calendar_timedeltas(start, end, freq)
            days_diffs.append(period_len.components.days)

        delta_days = period_len.components.days - mode(days_diffs).mode[0]
        period_len -= pd.Timedelta(days=delta_days)

        # Finally use period_len as column name
        column_name = timedelta_to_string(period_len)
        forward_returns[column_name] = fwdret.stack()

    forward_returns.index = forward_returns.index.rename(['date', 'asset'])

    return forward_returns

def get_clean_factor_and_forward_scores(factor,
                                         scores,
                                         groupby=None,
                                         binning_by_group=False,
                                         quantiles=5,
                                         bins=None,
                                         periods=(1, 5, 10),
                                         filter_zscore=20,
                                         groupby_labels=None,
                                         max_loss=10.0):
    
    forward_scores = compute_forward_scores(factor, scores, periods,
                                              filter_zscore)

    factor_data = get_clean_factor(factor, forward_scores, groupby=groupby,
                                   groupby_labels=groupby_labels,
                                   quantiles=quantiles, bins=bins,
                                   binning_by_group=binning_by_group,
                                   max_loss=max_loss)

    return factor_data


def infer_trading_calendar(factor_idx, prices_idx):
    """
    Infer the trading calendar from factor and price information.
    Parameters
    ----------
    factor_idx : pd.DatetimeIndex
        The factor datetimes for which we are computing the forward returns
    prices_idx : pd.DatetimeIndex
        The prices datetimes associated withthe factor data
    Returns
    -------
    calendar : pd.DateOffset
    """
    full_idx = factor_idx.union(prices_idx)

    # drop days of the week that are not used
    days_to_keep = []
    days_of_the_week = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
    for day, day_str in enumerate(days_of_the_week):
        if (full_idx.dayofweek == day).any():
            days_to_keep.append(day_str)

    days_to_keep = ' '.join(days_to_keep)

    # we currently don't infer holidays, but CustomBusinessDay class supports
    # custom holidays. So holidays could be inferred too eventually
    return CustomBusinessDay(weekmask=days_to_keep)

def diff_custom_calendar_timedeltas(start, end, freq):
    """
    Compute the difference between two pd.Timedelta taking into consideration
    custom frequency, which is used to deal with custom calendars, such as a
    trading calendar
    Parameters
    ----------
    start : pd.Timestamp
    end : pd.Timestamp
    freq : DateOffset, optional
    Returns
    -------
    pd.Timedelta
        end - start
    """
    actual_days = pd.date_range(start, end, freq=freq).shape[0] - 1
    timediff = end - start
    delta_days = timediff.components.days - actual_days
    return timediff - pd.Timedelta(days=delta_days)

def get_clean_factor(factor,
                     forward_returns,
                     groupby=None,
                     binning_by_group=False,
                     quantiles=5,
                     bins=None,
                     groupby_labels=None,
                     max_loss=0.35):
    """
    Formats the factor data, forward return data, and group mappings into a
    DataFrame that contains aligned MultiIndex indices of timestamp and asset.
    The returned data will be formatted to be suitable for Alphalens functions.
    It is safe to skip a call to this function and still make use of Alphalens
    functionalities as long as the factor data conforms to the format returned
    from get_clean_factor_and_forward_returns and documented here
    Parameters
    ----------
    factor : pd.Series - MultiIndex
        A MultiIndex Series indexed by timestamp (level 0) and asset
        (level 1), containing the values for a single alpha factor.
        ::
            -----------------------------------
                date    |    asset   |
            -----------------------------------
                        |   AAPL     |   0.5
                        -----------------------
                        |   BA       |  -1.1
                        -----------------------
            2014-01-01  |   CMG      |   1.7
                        -----------------------
                        |   DAL      |  -0.1
                        -----------------------
                        |   LULU     |   2.7
                        -----------------------
    forward_returns : pd.DataFrame - MultiIndex
        A MultiIndex DataFrame indexed by timestamp (level 0) and asset
        (level 1), containing the forward returns for assets.
        Forward returns column names must follow the format accepted by
        pd.Timedelta (e.g. '1D', '30m', '3h15m', '1D1h', etc).
        'date' index freq property must be set to a trading calendar
        (pandas DateOffset), see infer_trading_calendar for more details.
        This information is currently used only in cumulative returns
        computation
        ::
            ---------------------------------------
                       |       | 1D  | 5D  | 10D
            ---------------------------------------
                date   | asset |     |     |
            ---------------------------------------
                       | AAPL  | 0.09|-0.01|-0.079
                       ----------------------------
                       | BA    | 0.02| 0.06| 0.020
                       ----------------------------
            2014-01-01 | CMG   | 0.03| 0.09| 0.036
                       ----------------------------
                       | DAL   |-0.02|-0.06|-0.029
                       ----------------------------
                       | LULU  |-0.03| 0.05|-0.009
                       ----------------------------
    groupby : pd.Series - MultiIndex or dict
        Either A MultiIndex Series indexed by date and asset,
        containing the period wise group codes for each asset, or
        a dict of asset to group mappings. If a dict is passed,
        it is assumed that group mappings are unchanged for the
        entire time period of the passed factor data.
    binning_by_group : bool
        If True, compute quantile buckets separately for each group.
        This is useful when the factor values range vary considerably
        across gorups so that it is wise to make the binning group relative.
        You should probably enable this if the factor is intended
        to be analyzed for a group neutral portfolio
    quantiles : int or sequence[float]
        Number of equal-sized quantile buckets to use in factor bucketing.
        Alternately sequence of quantiles, allowing non-equal-sized buckets
        e.g. [0, .10, .5, .90, 1.] or [.05, .5, .95]
        Only one of 'quantiles' or 'bins' can be not-None
    bins : int or sequence[float]
        Number of equal-width (valuewise) bins to use in factor bucketing.
        Alternately sequence of bin edges allowing for non-uniform bin width
        e.g. [-4, -2, -0.5, 0, 10]
        Chooses the buckets to be evenly spaced according to the values
        themselves. Useful when the factor contains discrete values.
        Only one of 'quantiles' or 'bins' can be not-None
    groupby_labels : dict
        A dictionary keyed by group code with values corresponding
        to the display name for each group.
    max_loss : float, optional
        Maximum percentage (0.00 to 1.00) of factor data dropping allowed,
        computed comparing the number of items in the input factor index and
        the number of items in the output DataFrame index.
        Factor data can be partially dropped due to being flawed itself
        (e.g. NaNs), not having provided enough price data to compute
        forward returns for all factor values, or because it is not possible
        to perform binning.
        Set max_loss=0 to avoid Exceptions suppression.
    Returns
    -------
    merged_data : pd.DataFrame - MultiIndex
        A MultiIndex Series indexed by date (level 0) and asset (level 1),
        containing the values for a single alpha factor, forward returns for
        each period, the factor quantile/bin that factor value belongs to, and
        (optionally) the group the asset belongs to.
        - forward returns column names follow the format accepted by
          pd.Timedelta (e.g. '1D', '30m', '3h15m', '1D1h', etc)
        - 'date' index freq property (merged_data.index.levels[0].freq) is the
          same as that of the input forward returns data. This is currently
          used only in cumulative returns computation
        ::
           -------------------------------------------------------------------
                      |       | 1D  | 5D  | 10D  |factor|group|factor_quantile
           -------------------------------------------------------------------
               date   | asset |     |     |      |      |     |
           -------------------------------------------------------------------
                      | AAPL  | 0.09|-0.01|-0.079|  0.5 |  G1 |      3
                      --------------------------------------------------------
                      | BA    | 0.02| 0.06| 0.020| -1.1 |  G2 |      5
                      --------------------------------------------------------
           2014-01-01 | CMG   | 0.03| 0.09| 0.036|  1.7 |  G2 |      1
                      --------------------------------------------------------
                      | DAL   |-0.02|-0.06|-0.029| -0.1 |  G3 |      5
                      --------------------------------------------------------
                      | LULU  |-0.03| 0.05|-0.009|  2.7 |  G1 |      2
                      --------------------------------------------------------
    """

    initial_amount = float(len(factor.index))

    factor = factor.copy()
    factor.index = factor.index.rename(['date', 'asset'])

    merged_data = forward_returns.copy()
    merged_data['factor'] = factor

    if groupby is not None:
        if isinstance(groupby, dict):
            diff = set(factor.index.get_level_values(
                'asset')) - set(groupby.keys())
            if len(diff) > 0:
                raise KeyError(
                    "Assets {} not in group mapping".format(
                        list(diff)))

            ss = pd.Series(groupby)
            groupby = pd.Series(index=factor.index,
                                data=ss[factor.index.get_level_values(
                                    'asset')].values)

        if groupby_labels is not None:
            diff = set(groupby.values) - set(groupby_labels.keys())
            if len(diff) > 0:
                raise KeyError(
                    "groups {} not in passed group names".format(
                        list(diff)))

            sn = pd.Series(groupby_labels)
            groupby = pd.Series(index=groupby.index,
                                data=sn[groupby.values].values)

        merged_data['group'] = groupby.astype('category')

    merged_data = merged_data.dropna()

    fwdret_amount = float(len(merged_data.index))

    no_raise = False if max_loss == 0 else True
    merged_data['factor_quantile'] = quantize_factor(merged_data,
                                                     quantiles,
                                                     bins,
                                                     binning_by_group,
                                                     no_raise)

    merged_data = merged_data.dropna()

    binning_amount = float(len(merged_data.index))

    tot_loss = (initial_amount - binning_amount) / initial_amount
    fwdret_loss = (initial_amount - fwdret_amount) / initial_amount
    bin_loss = tot_loss - fwdret_loss

#     print("Dropped %.1f%% entries from factor data: %.1f%% in forward "
#           "returns computation and %.1f%% in binning phase "
#           "(set max_loss=0 to see potentially suppressed Exceptions)." %
#           (tot_loss * 100, fwdret_loss * 100,  bin_loss * 100))

    if tot_loss > max_loss:
        message = ("max_loss (%.1f%%) exceeded %.1f%%, consider increasing it."
                   % (max_loss * 100, tot_loss * 100))
        raise MaxLossExceededError(message)
    else:
#         print("max_loss is %.1f%%, not exceeded: OK!" % (max_loss * 100))
        pass

    return merged_data

def timedelta_to_string(timedelta):
    """
    Utility that converts a pandas.Timedelta to a string representation
    compatible with pandas.Timedelta constructor format
    Parameters
    ----------
    timedelta: pd.Timedelta
    Returns
    -------
    string
        string representation of 'timedelta'
    """
    c = timedelta.components
    format = ''
    if c.days != 0:
        format += '%dD' % c.days
    if c.hours > 0:
        format += '%dh' % c.hours
    if c.minutes > 0:
        format += '%dm' % c.minutes
    if c.seconds > 0:
        format += '%ds' % c.seconds
    if c.milliseconds > 0:
        format += '%dms' % c.milliseconds
    if c.microseconds > 0:
        format += '%dus' % c.microseconds
    if c.nanoseconds > 0:
        format += '%dns' % c.nanoseconds
    return format


def add_custom_calendar_timedelta(inputs, timedelta, freq):
    """
    Add timedelta to 'input' taking into consideration custom frequency, which
    is used to deal with custom calendars, such as a trading calendar
    Parameters
    ----------
    input : pd.DatetimeIndex or pd.Timestamp
    timedelta : pd.Timedelta
    freq : DateOffset, optional
    Returns
    -------
    pd.DatetimeIndex or pd.Timestamp
        input + timedelta
    """
    days = timedelta.components.days
    offset = timedelta - pd.Timedelta(days=days)
    return inputs + freq * days + offset

def non_unique_bin_edges_error(func):
    """
    Give user a more informative error in case it is not possible
    to properly calculate quantiles on the input dataframe (factor)
    """
    message = """
    An error occurred while computing bins/quantiles on the input provided.
    This usually happens when the input contains too many identical
    values and they span more than one quantile. The quantiles are choosen
    to have the same number of records each, but the same value cannot span
    multiple quantiles. Possible workarounds are:
    1 - Decrease the number of quantiles
    2 - Specify a custom quantiles range, e.g. [0, .50, .75, 1.] to get unequal
        number of records per quantile
    3 - Use 'bins' option instead of 'quantiles', 'bins' chooses the
        buckets to be evenly spaced according to the values themselves, while
        'quantiles' forces the buckets to have the same number of records.
    4 - for factors with discrete values use the 'bins' option with custom
        ranges and create a range for each discrete value
    Please see utils.get_clean_factor_and_forward_returns documentation for
    full documentation of 'bins' and 'quantiles' options.
"""

    def dec(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except ValueError as e:
            if 'Bin edges must be unique' in str(e):
                rethrow(e, message)
            raise
    return dec

@non_unique_bin_edges_error
def quantize_factor(factor_data,
                    quantiles=5,
                    bins=None,
                    by_group=False,
                    no_raise=False):
    """
    Computes period wise factor quantiles.
    Parameters
    ----------
    factor_data : pd.DataFrame - MultiIndex
        A MultiIndex DataFrame indexed by date (level 0) and asset (level 1),
        containing the values for a single alpha factor, forward returns for
        each period, the factor quantile/bin that factor value belongs to, and
        (optionally) the group the asset belongs to.
        - See full explanation in utils.get_clean_factor_and_forward_returns
    quantiles : int or sequence[float]
        Number of equal-sized quantile buckets to use in factor bucketing.
        Alternately sequence of quantiles, allowing non-equal-sized buckets
        e.g. [0, .10, .5, .90, 1.] or [.05, .5, .95]
        Only one of 'quantiles' or 'bins' can be not-None
    bins : int or sequence[float]
        Number of equal-width (valuewise) bins to use in factor bucketing.
        Alternately sequence of bin edges allowing for non-uniform bin width
        e.g. [-4, -2, -0.5, 0, 10]
        Only one of 'quantiles' or 'bins' can be not-None
    by_group : bool
        If True, compute quantile buckets separately for each group.
    no_raise: bool, optional
        If True, no exceptions are thrown and the values for which the
        exception would have been thrown are set to np.NaN
    Returns
    -------
    factor_quantile : pd.Series
        Factor quantiles indexed by date and asset.
    """
    if not ((quantiles is not None and bins is None) or
            (quantiles is None and bins is not None)):
        raise ValueError('Either quantiles or bins should be provided')

    def quantile_calc(x, _quantiles, _bins, _no_raise):
        try:
            if _quantiles is not None and _bins is None:
                return pd.qcut(x, _quantiles, labels=False) + 1
            elif _bins is not None and _quantiles is None:
                return pd.cut(x, _bins, labels=False) + 1
        except Exception as e:
            if _no_raise:
                return pd.Series(index=x.index)
            raise e

    grouper = [factor_data.index.get_level_values('date')]
    if by_group:
        grouper.append('group')

    factor_quantile = factor_data.groupby(grouper)['factor'] \
        .apply(quantile_calc, quantiles, bins, no_raise)
    factor_quantile.name = 'factor_quantile'

    return factor_quantile.dropna()

In [16]:
def new_is_positive():
    return False

In [ ]: