Additional fairness analyses


In [ ]:
# show a note for ETS users
from rsmtool import HAS_RSMEXTRA
if HAS_RSMEXTRA:
    from rsmextra.settings import fairness_note
    display(Markdown(fairness_note))

This section presents additional fairness analyses described in detail in Loukina et al. (2019). These analyses consider separately different definitions of fairness and can assist in further trouble-shooting the observed subgroup differences. The evaluations focuses on three dimensions:

  • Outcome fairness measures:

    • Overall score accuracy: whether the automated scores are equally accurate for each group. The metric shows how much of the variance in squared error $(S-H)^2$ is explained by subgroup membership.

    • Overall score difference: whether the automated scores are consistently different from human scores for members of a certain group. The metric shows how much of the variance in actual error $S-H$ is explained by subgroup membership.

The differences in the outcome fairness measures might be due to different mean scores (different score distributions) across subgroups or due to differential treatment of different subgroups by the scoring engine or both.

  • Process fairness measures:

    • Conditional score difference: whether the automated scoring model assigns different scores to members from different groups despite them having the same construct proficiency. The metric shows how much additional variance in actual error ($S-H$) is explained by subgroup membership after controlling for human score, which can be thought of as a reasonable proxy for proficiency.

The differences in process fairness measures indicate differential treatment of different subgroups by the scoring model.


In [7]:
# define and auxiliary function that we will need 
# for the plot later on
def errplot(x, y, xerr, **kwargs):
    ax = plt.gca()
    data = kwargs.pop("data")
    # let's remove color from kwargs
    color = kwargs.pop('color')
    data.plot(x=x, y=y, xerr=xerr,
              kind="barh", ax=ax,
              color=colors,
              **kwargs)

In [ ]:
# check if we already created the merged file in another notebook

try:
    df_pred_preproc_merged
except NameError:
    df_pred_preproc_merged = pd.merge(df_pred_preproc, df_test_metadata, on = 'spkitemid')
    
# check which score we are using
system_score_column = "scale_trim" if use_scaled_predictions else "raw_trim"

for group in groups_eval:
    
    display(Markdown("### Additional fairness evaluation by {}".format(group)))
    
    # run the actual analyses. We are currently doing this in the notebook
    # so that it is easy to skip these if necessary. The notebook
    # and analyses are set up in a way that will make it easy to move these
    # in future to the main pipeline and read in the outputs here. 
    fit_dict, fairness_container = get_fairness_analyses(df_pred_preproc_merged,
                                                         group,
                                                         system_score_column)
    
    # write the results to disk so that we can consider them in tests
    write_fairness_results(fit_dict,
                           fairness_container,
                           group,
                           output_dir,
                           experiment_id,
                           file_format)
    
    # first show summary results
    df_fairness = fairness_container['fairness_metrics_by_{}'.format(group)]

    
    display(Markdown("The summary table shows the overall score accuracy (OSA), overall score difference (OSD) "
                    "and conditional score difference (CSD). The first row reports the percentage of variance, "
                    " explained by group membership, the second row shows $p$ value. "
                     "{} was used as a reference category. "
                    "Larger values of R2 indicate larger differences between subgroups. " 
                    "Further detail about each model can be found in [intermediate "
                    "output files](#Links-to-Intermediate-Files).".format(df_fairness['base_category'].values[0])))
    
    display(HTML(df_fairness.loc[['R2', 'sig'],
                                 ['Overall score accuracy',
                                  'Overall score difference',
                                  'Conditional score difference']].to_html(classes='sortable',
                                                                     float_format=float_format_func)))

    
    Markdown_str = [("The plots show error estimates for different categories for each group "
                    "(squared error for OSA, raw error for OSD, and conditional raw error for CSD) "
                    "The estimates have been adjusted for the value of the group used as the Intercept. "
                    "Black lines show 95% confidence intervals estimated by the model.")]
    
    # if we only care about groups above threshold, identify those.
    
    category_counts = df_pred_preproc_merged[group].value_counts()        
    
    if group in min_n_per_group:
        Markdown_str.append("While the models were fit to all data, the plots only show estimates for "
                            "categories with more than {} members and the Intercept ({}).".format(min_n_per_group[group],
                                                                                                  df_fairness['base_category'].values[0]))
        
        groups_by_size = category_counts[category_counts >= min_n_per_group[group]].index
        df_pred_preproc_selected = df_pred_preproc_merged[df_pred_preproc_merged[group].isin(groups_by_size)].copy()
    else:
        groups_by_size = category_counts.index
        df_pred_preproc_selected = df_pred_preproc_merged.copy()
    
    display(Markdown('\n'.join(Markdown_str)))
   
    
    if len(groups_by_size) > 0:

        # assemble all coefficients into a long data frame
        all_coefs = []
        for metrics in ['osa', 'csd', 'osd']:
            df_metrics = fairness_container['estimates_{}_by_{}'.format(metrics, group)].copy()
            # compute adjusted error estimates by adding the value of the Intercept
            # to all non-Intercept values
            non_index_cols = [r for r in df_metrics.index if not "Intercept" in r]
            index_col = [r for r in df_metrics.index if "Intercept" in r]
            df_metrics['error_estimate'] = df_metrics['estimate']
            df_metrics.loc[non_index_cols,
                          'error_estimate'] = df_metrics.loc[non_index_cols,
                                                            'estimate'] + df_metrics.loc[index_col,
                                                                                        'estimate'].values
            # create a column for metrics
            df_metrics['metrics'] = metrics
            # only use groups with values above threshold and the intercept
            df_metrics[group] = df_metrics.index
            df_metrics_selected = df_metrics[df_metrics[group].isin(groups_by_size) |
                                             (df_metrics[group] == index_col[0])]
            all_coefs.append(df_metrics_selected)


        # show coefficient plots
        # define groups and color palette
        colors = sns.color_palette("Greys_r", len(groups_by_size))

        df_coefs_all = pd.concat(all_coefs)

        # compute the size of the confidence interval from the boundary
        df_coefs_all['CI'] = df_coefs_all['[0.025'] - df_coefs_all['estimate']

        # plot the coefficients
        with sns.axes_style('whitegrid'), sns.plotting_context('notebook', font_scale=2):
            g = sns.FacetGrid(df_coefs_all, col="metrics",
                              height=10, col_order = ['osa', 'osd', 'csd'])
            g.map_dataframe(errplot, group, "error_estimate",  "CI").set_axis_labels("Error estimate",
                                                                                    group)

            imgfile = join(figure_dir, '{}_fairness_estimates_{}.svg'.format(experiment_id, group))
            plt.savefig(imgfile)
            if use_thumbnails:
                show_thumbnail(imgfile, next(id_generator))
            else:
                plt.show()


        # Show the conditional score plot
        markdown_str = [("The plot shows average {} system score for each "
                        "group conditioned "
                         "on human score.".format(system_score_column))]
        if group in min_n_per_group:
            markdown_str.append("The plot only shows estimates for "
                               "categories with more than {} members and the"
                                "reference category ({}).".format(min_n_per_group[group],
                                                          df_fairness['base_category'].values[0]))
        
        display(Markdown('\n'.join(markdown_str)))
        


        with sns.axes_style('whitegrid'), sns.plotting_context('notebook', font_scale=1.2):
            p = sns.catplot(x='sc1', y=system_score_column,
                            hue=group, 
                            hue_order = groups_by_size,
                            palette=colors,
                            legend_out=True,
                            kind="point",
                            data=df_pred_preproc_selected)

            #plt.tight_layout(h_pad=1.0)
            imgfile = join(figure_dir, '{}_conditional_score_{}.svg'.format(experiment_id, group))
            plt.savefig(imgfile)
            if use_thumbnails:
                show_thumbnail(imgfile, next(id_generator))
            else:
                plt.show()
    else:
        display(Markdown("None of the groups in {} had {} or more responses in the evaluation set.".format(group,
                                                                                     min_n_per_group[group])))