Model


In [ ]:
Markdown('Model used: **{}**'.format(model_name))

In [ ]:
Markdown('Number of features in model: **{}**'.format(len(features_used)))

In [ ]:
builtin_ols_models = ['LinearRegression',
                      'EqualWeightsLR',
                      'RebalancedLR',
                      'NNLR',
                      'LassoFixedLambdaThenNNLR',
                      'LassoFixedLambdaThenLR',
                      'PositiveLassoCVThenLR',
                      'WeightedLeastSquares']

builtin_lasso_models = ['LassoFixedLambda',
                        'PositiveLassoCV']

In [ ]:
# we first just show a summary of the OLS model and the main model parameters
if model_name in builtin_ols_models:
    display(Markdown('### Model summary'))
    summary_file = join(output_dir, '{}_ols_summary.txt'.format(experiment_id))
    with open(summary_file, 'r') as summf:
        model_summary = summf.read()
        print(model_summary)
     
    display(Markdown('### Model fit'))
    df_fit = DataReader.read_from_file(join(output_dir, '{}_model_fit.{}'.format(experiment_id,
                                                                                 file_format)))
    display(HTML(df_fit.to_html(index=False,
                                float_format=float_format_func)))

Standardized and relative regression coefficients (betas)

The relative coefficients are intended to show relative contribution of different feature and their primary purpose is to indentify whether one of the features has an unproportionate effect over the final score. They are computed as standardized/(sum of absolute values of standardized coefficients).

Negative standardized coefficients are highlighted.

Note: if the model contains negative coefficients, relative values will not sum up to one and their interpretation is generally questionable.


In [ ]:
markdown_str = """
**Note**: The coefficients were estimated using LASSO regression. Unlike OLS (standard) linear regression, lasso estimation is based on an optimization routine and therefore the exact estimates may differ across different systems. """

if model_name in builtin_lasso_models:
    display(Markdown(markdown_str))

In [ ]:
df_betas.sort_values(by='feature', inplace=True)
display(HTML(df_betas.to_html(classes=['sortable'], 
                              index=False, 
                              escape=False,
                              float_format=float_format_func,
                              formatters={'standardized': color_highlighter})))

Here are the same values, shown graphically.


In [ ]:
df_betas_sorted = df_betas.sort_values(by='standardized', ascending=False)
df_betas_sorted.reset_index(drop=True, inplace=True)
fig = plt.figure()
fig.set_size_inches(8, 3)
fig.subplots_adjust(bottom=0.5)
grey_colors = sns.color_palette('Greys', len(features_used))[::-1]
with sns.axes_style('whitegrid'):
    ax1=fig.add_subplot(121)
    sns.barplot("feature","standardized", data=df_betas_sorted, 
                order=df_betas_sorted['feature'].values,
                palette=sns.color_palette("Greys", 1), ax=ax1)
    ax1.set_xticklabels(df_betas_sorted['feature'].values, rotation=90)
    ax1.set_title('Values of standardized coefficients')
    ax1.set_xlabel('')
    ax1.set_ylabel('')
    # no pie chart if we have more than 15 features or if the feature names are long (pie chart looks ugly)
    if len(features_used) <= 15 and longest_feature_name <= 10:
        ax2=fig.add_subplot(133, aspect=True)
        ax2.pie(abs(df_betas_sorted['relative'].values), colors=grey_colors, 
            labels=df_betas_sorted['feature'].values)
        ax2.set_title('Proportional contribution of each feature')
    else:
        fig.set_size_inches(len(features_used), 3)
    betas_file = join(figure_dir, '{}_betas.svg'.format(experiment_id))
    plt.savefig(betas_file)

    if use_thumbnails:
        show_thumbnail(betas_file, next(id_generator))
    else:
        plt.show()

In [ ]:
if model_name in builtin_ols_models:
    display(Markdown('### Model diagnostics'))
    display(Markdown("These are standard plots for model diagnostics for the main model. All information is computed based on the training set."))

In [ ]:
# read in the OLS model file and create the diagnostics plots
if model_name in builtin_ols_models:
    ols_file = join(output_dir, '{}.ols'.format(experiment_id))
    model = pickle.load(open(ols_file, 'rb'))
    model_predictions = model.predict()

    with sns.axes_style('white'):
        f, (ax1, ax2) = plt.subplots(1, 2)
        f.set_size_inches((10, 4))
        
        ###
        # for now, we do not show the influence plot since it can be slow to generate
        ###
        # sm.graphics.influence_plot(model.sm_ols, criterion="cooks", size=10, ax=ax1)
        # ax1.set_title('Residuals vs. Leverage', fontsize=16)
        # ax1.set_xlabel('Leverage', fontsize=16)
        # ax1.set_ylabel('Standardized Residuals', fontsize=16)

        sm.qqplot(model.resid, stats.norm, fit=True, line='q', ax=ax1)
        ax1.set_title('Normal Q-Q Plot', fontsize=16)
        ax1.set_xlabel('Theoretical Quantiles', fontsize=16)
        ax1.set_ylabel('Sample Quantiles', fontsize=16)

        ax2.scatter(model_predictions, model.resid)
        ax2.set_xlabel('Fitted values', fontsize=16)
        ax2.set_ylabel('Residuals', fontsize=16)
        ax2.set_title('Residuals vs. Fitted', fontsize=16)

        imgfile = join(figure_dir, '{}_ols_diagnostic_plots.png'.format(experiment_id))
        plt.savefig(imgfile)

        if use_thumbnails:
            show_thumbnail(imgfile, next(id_generator))
        else:
            display(Image(imgfile))
        plt.close()