Introduction

Problem Description

Data-driven approaches are now used in many fields from business to science. Since data storage and computational power has become cheap, machine learning has gained popularity. However, the majority of tools that can extract dependencies from data, are designed for prediction problem. In this notebook, a problem of decision support simulation is considered and it is shown that even good predictive models can lead to wrong conclusions. This occurs under some conditions summarized by an umbrella term called endogeneity. Its particular cases are as follows:

  • An important variable is omitted;
  • Variables that are used as features are measured with biases;
  • There is simultaneous or reverse causality between a target variable and some features.

Here, important variable omission is a root of a trouble.

Suppose that situation is as follows. There is a freshly-hired manager that can assign treatment to items in order to increase target metric. Treatment is binary, i.e. for each item it is assigned or it is absent. Because treatment costs something, its assignment should be optimized — only some items should be treated. A historical dataset of items performance is given, but the manager does not know that previously treatment was assigned predominantely based on values of just one parameter. Moreover, this parameter is not included in the dataset. By the way, the manager wants to create a system that predicts an item's target metric in case of treatment and in case of absence of treatment. If this system is deployed, the manager can compare these two cases and decide whether effect of treatment worths its costs.

If machine learning approach results in good prediction scores, chances are that the manager does not suspect that important variable is omitted (at least until some expenses are generated by wrong decisions). Hence, domain knowledge and data understanding are still required for modelling based on data. This is of particular importance when datasets contain values that are produced by someone's decisions, because there is no guarantee that future decisions will not change dramatically. On the flip side, if all factors that affect decisions are included in a dataset, i.e., there is selection on observables for treatment assignment, a powerful enough model is able to estimate treatment effect correctly (but accuracy of predictions still does not ensure causal relationships detection).

References

To read more about causality in data analysis, it is possible to look at these papers:

  1. Angrist J, Pischke J-S. Mostly Harmless Econometrics. Princeton University Press, 2009.

  2. Varian H. Big Data: New Tricks for Econometrics. Journal of Economic Perspectives, 28(2): 3–28, 2013

Preparations

General


In [1]:
from itertools import combinations

import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np

from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

# Startup settings can not suppress a warning from `XGBRegressor` and so this is needed.
import warnings
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    from xgboost import XGBRegressor

In [2]:
np.random.seed(seed=361)

Synthetic Dataset Generation

Let us generate an unobserved parameter and an indicator of treatment such that they are highly correlated.


In [3]:
unobserved = np.hstack((np.ones(10000), np.zeros(10000)))
treatment = np.hstack((np.ones(9000), np.zeros(10000), np.ones(1000)))

In [4]:
np.corrcoef(unobserved, treatment)


Out[4]:
array([[ 1. ,  0.8],
       [ 0.8,  1. ]])

Now create historical dataset that is used for learning predictive model.


In [5]:
def synthesize_dataset(unobserved, treatment,
                       given_exogenous=None, n_exogenous_to_draw=2,
                       weights_matrix=np.array([[5, 0, 0, 0],
                                                [0, 1, 1, 0],
                                                [0, 1, 2, 1],
                                                [0, 0, 1, 3]])):
    """
    A helper function for repetitive
    pieces of code.
    
    Creates a dataset, where target depends on
    `unobserved`, but `unobserved` is not
    included as a feature. Independent features
    can be passed as `given_exogenous` as well as
    be drawn from Gaussian distribution.
    
    Target is generated as linear combination of
    features and their interactions in the
    following manner. Order features as below:
    unobserved variable, treatment indicator,
    given exogenous features, drawn exogenous
    features. Then the (i, i)-th element of
    `weights_matrix` defines coefficient of
    the i-th feature, whereas the (i, j)-th
    element of `weights_matrix` (where i != j)
    defines coefficient of interaction between
    the i-th and j-th features.

    @type unobserved: numpy.ndarray
    @type treatment: numpy.ndarray
    @type given_exogenous: numpy.ndarray
    @type n_exogenous_to_draw: int
    @type weights_matrix: numpy.ndarray
    @rtype: tuple(numpy.ndarray)
    """

    if unobserved.shape != treatment.shape:
        raise ValueError("`unobserved` and `treatment` are not aligned.")
    if (given_exogenous is not None and
            unobserved.shape[0] != given_exogenous.shape[0]):
        raise ValueError("`unobserved` and `given_exogenous` are not " +
                         "aligned. Try to transpose `given_exogenous`.")
    if weights_matrix.shape[0] != weights_matrix.shape[1]:
        raise ValueError("Matrix of weights is not square.")
    if not np.array_equal(weights_matrix, weights_matrix.T):
        raise ValueError("Matrix of weigths is not symmetric.")
    len_of_given = given_exogenous.shape[1] if given_exogenous is not None else 0
    if 2 + len_of_given + n_exogenous_to_draw != weights_matrix.shape[0]:
        raise ValueError("Number of weights is not equal to that of features.")

    drawn_features = []
    for i in range(n_exogenous_to_draw):
        current_feature = np.random.normal(size=unobserved.shape[0])
        drawn_features.append(current_feature)
    if given_exogenous is None:
        features = np.vstack([unobserved, treatment] + drawn_features).T
    else:
        features = np.vstack([unobserved, treatment, given_exogenous.T] +
                             drawn_features).T
    target = np.dot(features, weights_matrix.diagonal())
    indices = list(range(weights_matrix.shape[0]))
    interactions = [weights_matrix[i, j] * features[:, i] * features[:, j]
                    for i, j in combinations(indices, 2)]
    target = np.sum(np.vstack([target] + interactions), axis=0)
    return features[:, 1:], target

In [6]:
learning_X, learning_y = synthesize_dataset(unobserved, treatment)

Now create two datasets for simulation where the only difference between them is that in the first one treatment is absent and in the second one treatment is assigned to all items.


In [7]:
unobserved = np.hstack((np.ones(2500), np.zeros(2500)))

In [8]:
no_treatment = np.zeros(5000)
full_treatment = np.ones(5000)

In [9]:
no_treatment_X, no_treatment_y = synthesize_dataset(unobserved, no_treatment)
full_treatment_X, full_treatment_y = synthesize_dataset(unobserved, full_treatment,
                                                        no_treatment_X[:, 1:], 0)

Look at the data that are used for simulation.


In [10]:
no_treatment_X[:5, :]


Out[10]:
array([[ 0.        ,  1.58191727, -1.17154285],
       [ 0.        , -0.56614096, -0.24566541],
       [ 0.        , -0.3521775 ,  2.37789361],
       [ 0.        , -1.19089718, -0.54156554],
       [ 0.        ,  1.41070964, -0.14438024]])

In [11]:
full_treatment_X[:5, :]


Out[11]:
array([[ 1.        ,  1.58191727, -1.17154285],
       [ 1.        , -0.56614096, -0.24566541],
       [ 1.        , -0.3521775 ,  2.37789361],
       [ 1.        , -1.19089718, -0.54156554],
       [ 1.        ,  1.41070964, -0.14438024]])

In [12]:
no_treatment_y[:5]


Out[12]:
array([  2.79592213,   3.2698031 ,  10.59188519,   1.63845791,   7.18459995])

In [13]:
full_treatment_y[:5]


Out[13]:
array([  5.3778394 ,   3.70366215,  11.23970769,   1.44756073,   9.59530959])

Good Model...


In [14]:
X_train, X_test, y_train, y_test = train_test_split(learning_X, learning_y,
                                                    random_state=361)
X_train.shape, X_test.shape, y_train.shape, y_test.shape


Out[14]:
((15000, 3), (5000, 3), (15000,), (5000,))

In [15]:
def tune_inform(X_train, y_train, rgr, grid_params, kf, scoring):
    """
    Just a helper function that combines
    all routines related to grid search.
    
    @type X_train: numpy.ndarray
    @type y_train: numpy.ndarray
    @type rgr: any sklearn regressor
    @type grid_params: dict
    @type kf: any sklearn folds
    @type scoring: str
    @rtype: sklearn regressor
    """
    grid_search_cv = GridSearchCV(rgr, grid_params, cv=kf,
                                  scoring=scoring)
    grid_search_cv.fit(X_train, y_train)
    print("Best CV mean score: {}".format(grid_search_cv.best_score_))
    means = grid_search_cv.cv_results_['mean_test_score']
    stds = grid_search_cv.cv_results_['std_test_score']
    print("Detailed results:")
    for mean, std, params in zip(means, stds,
                                 grid_search_cv.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r" % (mean, 2 * std, params))
    return grid_search_cv.best_estimator_

In [16]:
rgr = LinearRegression()
grid_params = {'fit_intercept': [True, False]}
kf = KFold(n_splits=5, shuffle=True, random_state=361)

Let us use coefficient of determination as a scorer rather than MSE. Actually, they are linearly dependent: $R^2 = 1 - \frac{MSE}{\mathrm{Var}(y)}$, but coefficient of determination is easier to interpret.


In [17]:
rgr = tune_inform(X_train, y_train, rgr, grid_params, kf, 'r2')


Best CV mean score: 0.8597040594785548
Detailed results:
0.860 (+/-0.007) for {'fit_intercept': True}
0.854 (+/-0.008) for {'fit_intercept': False}

In [18]:
y_hat = rgr.predict(X_test)
r2_score(y_test, y_hat)


Out[18]:
0.86724859617745453

Although true relationship is non-linear, predictive power of linear regression is good. This is indicated by close to 1 coefficient of determination. Since the winner is model with intercept, its score can be interpreted as follows — the model explains almost all variance of the target around its mean (note that such interpretation can not be used for a model without intercept).


In [19]:
rgr = XGBRegressor()
grid_params = {'n_estimators': [50, 100, 200, 300],
               'max_depth': [3, 5],
               'subsample': [0.8, 1]}
kf = KFold(n_splits=5, shuffle=True, random_state=361)

In [20]:
rgr = tune_inform(X_train, y_train, rgr, grid_params, kf, 'r2')


Best CV mean score: 0.9050386408201856
Detailed results:
0.900 (+/-0.007) for {'max_depth': 3, 'n_estimators': 50, 'subsample': 0.8}
0.900 (+/-0.008) for {'max_depth': 3, 'n_estimators': 50, 'subsample': 1}
0.905 (+/-0.010) for {'max_depth': 3, 'n_estimators': 100, 'subsample': 0.8}
0.905 (+/-0.011) for {'max_depth': 3, 'n_estimators': 100, 'subsample': 1}
0.905 (+/-0.010) for {'max_depth': 3, 'n_estimators': 200, 'subsample': 0.8}
0.905 (+/-0.011) for {'max_depth': 3, 'n_estimators': 200, 'subsample': 1}
0.905 (+/-0.010) for {'max_depth': 3, 'n_estimators': 300, 'subsample': 0.8}
0.905 (+/-0.011) for {'max_depth': 3, 'n_estimators': 300, 'subsample': 1}
0.905 (+/-0.010) for {'max_depth': 5, 'n_estimators': 50, 'subsample': 0.8}
0.905 (+/-0.010) for {'max_depth': 5, 'n_estimators': 50, 'subsample': 1}
0.905 (+/-0.010) for {'max_depth': 5, 'n_estimators': 100, 'subsample': 0.8}
0.905 (+/-0.010) for {'max_depth': 5, 'n_estimators': 100, 'subsample': 1}
0.903 (+/-0.010) for {'max_depth': 5, 'n_estimators': 200, 'subsample': 0.8}
0.904 (+/-0.010) for {'max_depth': 5, 'n_estimators': 200, 'subsample': 1}
0.902 (+/-0.011) for {'max_depth': 5, 'n_estimators': 300, 'subsample': 0.8}
0.902 (+/-0.010) for {'max_depth': 5, 'n_estimators': 300, 'subsample': 1}

It looks like almost all combinations of hyperparameters result in error that is close to irreducible error caused by mismatches between the indicator of treatment and the omitted variable.


In [21]:
y_hat = rgr.predict(X_test)
r2_score(y_test, y_hat)


Out[21]:
0.91262743224471077

The score is even closer to 1 than in case of linear model. Decent result deceptively motivates to think that all important variables are included in the model.

...and Poor Simulation


In [22]:
no_treatment_y_hat = rgr.predict(no_treatment_X)
r2_score(no_treatment_y, no_treatment_y_hat)


Out[22]:
0.47184792256463415

In [23]:
full_treatment_y_hat = rgr.predict(full_treatment_X)
r2_score(full_treatment_y, full_treatment_y_hat)


Out[23]:
0.5723373344573468

And now scores are not perfect, are they?


In [24]:
fig = plt.figure(figsize=(14, 7))

ax_one = fig.add_subplot(121)
ax_one.scatter(no_treatment_y_hat, no_treatment_y)
ax_one.set_title("Simulation of absence of treatment")
ax_one.set_xlabel("Predicted values")
ax_one.set_ylabel("True values")
ax_one.grid()

ax_two = fig.add_subplot(122, sharey=ax_one)
ax_two.scatter(full_treatment_y_hat, full_treatment_y)
ax_two.set_title("Simulation of treatment")
ax_two.set_xlabel("Predicted values")
ax_two.set_ylabel("True values")
_ = ax_two.grid()


It can be seen that effect of treatment is overestimated. In case of absence of treatment, for items with unobserved feature equal to 1, predictions are significantly less than true values. To be more precise, the differences are close to coefficient near unobserved feature in weights_matrix passed to the dataset creation. Similarly, in case of full treatment, for items with unobserved feature equal to 0, predictions are higher than true values and the differences are close to the abovementioned coefficient too.

Finally, let us simulate a wrong decision that the manager can make. Suppose that treatment costs one dollar per item and every unit increase in the target variable leads to creation of value that is equal to one dollar too.


In [25]:
estimated_effects = full_treatment_y_hat - no_treatment_y_hat
true_effects = full_treatment_y - no_treatment_y

In [26]:
np.min(estimated_effects)


Out[26]:
2.3798404

The model recommends to treat all items. What happens if all of them are treated?


In [27]:
cost_of_one_treatment = 1

In [28]:
estimated_net_improvement = (np.sum(estimated_effects) -
                             cost_of_one_treatment * estimated_effects.shape[0])
estimated_net_improvement


Out[28]:
19590.859375

In [29]:
true_net_improvement = (np.sum(true_effects) -
                        cost_of_one_treatment * true_effects.shape[0])
true_net_improvement


Out[29]:
-104.43062664346235

Suddenly, the manager will have small loss instead of solid profit.

Conclusion

It has been shown that formal metrics used in model evaluation may not reflect all sides of a problem. Sometimes, learning sample is biased and is not similar to samples that require predictions. This often occurs when machine learning affects decisions and new decisions differs from those that have been made for objects in the learning sample.

The described case might look too artificial, but below are two real-world examples of similar issues:

  • The goal is to train neural network to detect certain phrase in recorded speech. Assume that, unfortunately, all occurrences of the phrase in the learning sample are recorded by the same microphone in the same room and this flaw is not known. As a result, variable that indicates occurrence of the phrase and omitted variable that indicates used microphone, are confound. Neural network does not understand which target to learn and so produces good results during cross-validation and on hold-out test set, but poor results in a production environment.

  • In labor econometrics, a problem of estimation higher education effect on wages is well-studied. A subtle issue here is that people with higher abilities have more willingness to earn degrees and also they have more chances to have bigger salaries. Abilities are unobservable variable and, in naive models, their effect is attributed mainly to higher education, because there is strong correlation between these variables. Hence, naive modelling leads to overestimation of higher education effect.

Probably, sections of the notebook that illustrate ways to mitigate the consequences of lack of important unobserved variables, will be released after some time.