Load Libraries


In [1]:
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
import bootstrap_contrast as bsc

import pandas as pd
import numpy as np
import scipy as sp

Create dummy dataset

Here, we create a dummy dataset to illustrate how bootstrap-contrast functions. In this dataset, each column corresponds to a group of observations, and each row is simply an index number referring to an observation. (This is known as a 'wide' dataset.)


In [2]:
dataset=list()
for seed in [10,11,12,13,14,15]:
    np.random.seed(seed) # fix the seed so we get the same numbers each time.
    dataset.append(np.random.randn(40))
df=pd.DataFrame(dataset).T
cols=['Control','Group1','Group2','Group3','Group4','Group5']
df.columns=cols
# Create some upwards/downwards shifts.
df['Group2']=df['Group2']-0.1
df['Group3']=df['Group3']+0.2
df['Group4']=(df['Group4']*1.1)+4
df['Group5']=(df['Group5']*1.1)-1
# Add gender column.
df['Gender']=np.concatenate([np.repeat('Male',20),np.repeat('Female',20)])

Note that we have 6 groups of observations, with an additional non-numerical column indicating gender.

The bootstrap class

Here, we introduce a new class called bootstrap. Essentially, it will compute the summary statistic and its associated confidence interval using bootstrapping. It can do this for a single group of observations, or for two groups of observations (both paired and unpaired).

Below, I obtain the bootstrapped contrast for 'Control' and 'Group1' in df.


In [3]:
contr = bsc.bootstrap(df['Control'],df['Group1'])

As mentioned above, contr is a bootstrap object. Calling it directly will not produce anything.


In [4]:
contr


Out[4]:
<bootstrap_contrast.bootstrap_tools.bootstrap at 0x1064726d8>

It has several callable attributes. Of interest is its results attribute, which returns a dictionary summarising the results of the contrast computation.


In [5]:
contr.results


Out[5]:
{'bca_ci_high': 0.21530883448325841,
 'bca_ci_low': -0.59272396423959439,
 'ci': 95.0,
 'is_difference': True,
 'is_paired': False,
 'stat_summary': -0.1808044652703821}

is_paired indicates the two arrays are paired (or repeated) observations. This is indicated by the paired flag.


In [6]:
contr_paired = bsc.bootstrap(df['Control'],df['Group1'],
                             paired=True)
contr_paired.results


Out[6]:
{'bca_ci_high': 0.23101685944171341,
 'bca_ci_low': -0.57186646678627306,
 'ci': 95.0,
 'is_difference': True,
 'is_paired': True,
 'stat_summary': -0.18080446527038205}

is_difference basically indicates if one or two arrays were passed to the bootstrap function. Obseve what happens if we just give one array.


In [7]:
just_control = bsc.bootstrap(df['Control'])
just_control.results


Out[7]:
{'bca_ci_high': 0.45933914074754423,
 'bca_ci_low': -0.13338863345306345,
 'ci': 95.0,
 'is_difference': False,
 'is_paired': False,
 'stat_summary': 0.17175621510073041}

Here, the confidence interval is with respect to the mean of the group Control.

There are several other statistics the bootstrap object contains. Please do have a look at its documentation. Below, I print the p-values for contr_paired as an example.


In [8]:
contr_paired.pvalue_2samp_paired_ttest


Out[8]:
0.39310007728828344

In [9]:
contr_paired.pvalue_wilcoxon


Out[9]:
0.35369319267722144

Producing Plots

Version 0.3 of bootstrap-contrast has an optimised version of the contrastplot command.

Floating contrast plots—Two-group unpaired

Below we produce three aligned Gardner-Altman floating contrast plots.

The contrastplot command will return 2 objects: a matplotlib Figure and a pandas DataFrame. In the Jupyter Notebook, with %matplotlib inline, the figure should automatically appear.

bs.bootstrap will automatically drop any NaNs in the data. Note how the Ns (appended to the group names in the xtick labels) indicate the number of datapoints being plotted, and used to calculate the contrasts.

The pandas DataFrame returned by bs.bootstrap contains the pairwise comparisons made in the course of generating the plot, with confidence intervals (95% by default) and relevant p-values.


In [10]:
f, b = bsc.contrastplot(df,
                      idx=('Control','Group1'),
                      color_col='Gender',
                      fig_size=(4,6) # The length and width of the image, in inches.
                      )
b


Out[10]:
reference_group experimental_group stat_summary bca_ci_low bca_ci_high ci is_difference is_paired pvalue_2samp_ind_ttest pvalue_mannWhitney
0 Control Group1 -0.180804 -0.594608 0.219648 95.0 True False 0.395987 0.363178

Floating contrast plots—Two-group paired


In [11]:
f, b = bsc.contrastplot(df,
                        idx=('Control','Group2'),
                        color_col='Gender',
                        paired=True,
                        fig_size=(4,6))
b


Out[11]:
reference_group experimental_group stat_summary bca_ci_low bca_ci_high ci is_difference is_paired pvalue_2samp_paired_ttest pvalue_wilcoxon
0 Control Group2 -0.532006 -1.009002 -0.029927 95.0 True True 0.04253 0.038456

If you want to plot the raw swarmplot instead of the paired lines, use the show_pairs flag to set this. The contrasts computed will still be paired, as indicated by the DataFrame produced.


In [12]:
f, b = bsc.contrastplot(df,
                        idx=('Control','Group2'),
                        color_col='Gender',
                        paired=True,
                        show_pairs=False,
                        fig_size=(4,6))
b


Out[12]:
reference_group experimental_group stat_summary bca_ci_low bca_ci_high ci is_difference is_paired pvalue_2samp_paired_ttest pvalue_wilcoxon
0 Control Group2 -0.532006 -1.004253 -0.035963 95.0 True True 0.04253 0.038456

Floating contrast plots—Multi-plot design

In a multi-plot design, you can horizontally tile two or more two-group floating-contrasts. This is designed to meet data visualization and presentation paradigms that are predominant in academic biomedical research.

This is done mainly through the idx option. You can indicate two or more tuples to create a seperate subplot for that contrast.

The effect sizes and confidence intervals for each two-group plot will be computed.


In [13]:
f, b = bsc.contrastplot(df,
                        idx=(('Control','Group1'),
                             ('Group2','Group3')),
                        paired=True,
                        show_means='lines',
                        color_col='Gender')
b


Out[13]:
reference_group experimental_group stat_summary bca_ci_low bca_ci_high ci is_difference is_paired pvalue_2samp_paired_ttest pvalue_wilcoxon
0 Control Group1 -0.180804 -0.582288 0.217477 95.0 True True 0.393100 0.353693
1 Group2 Group3 0.700802 0.235262 1.153145 95.0 True True 0.005299 0.004378

Hub-and-spoke plots

A common experimental design seen in contemporary biomedical research is a shared-control, or 'hub-and-spoke' design. Two or more experimental groups are compared to a common control group.

A hub-and-spoke plot implements estimation statistics and aesthetics on such an experimental design.

If more than 2 columns/groups are indicated in a tuple passed to idx, then contrastplot will produce a hub-and-spoke plot, where the first group in the tuple is considered the control group. The mean difference and confidence intervals of each subsequent group will be computed against the first control group.


In [14]:
f, b = bsc.contrastplot(df,
                        idx=df.columns[:-1],
                        color_col='Gender')
b


Out[14]:
reference_group experimental_group stat_summary bca_ci_low bca_ci_high ci is_difference is_paired pvalue_2samp_ind_ttest pvalue_mannWhitney
0 Control Group1 -0.180804 -0.598903 0.232637 95.0 True False 3.959867e-01 3.631777e-01
1 Control Group2 -0.532006 -0.968608 -0.080709 95.0 True False 2.157452e-02 1.358049e-02
2 Control Group3 0.168796 -0.254775 0.602035 95.0 True False 4.394990e-01 4.161598e-01
3 Control Group4 3.709493 3.274405 4.130370 95.0 True False 1.748492e-27 1.937605e-14
4 Control Group5 -1.397284 -1.837762 -0.941801 95.0 True False 4.761711e-08 4.267387e-07

Hub-and-spoke plots—multi-plot design

You can also horizontally tile two or more hub-and-spoke plots.


In [15]:
f, b = bsc.contrastplot(df,
                        idx=(('Control','Group1'),('Group2','Group3'),
                             ('Group4','Group5')),
                        color_col='Gender')
b


Out[15]:
reference_group experimental_group stat_summary bca_ci_low bca_ci_high ci is_difference is_paired pvalue_2samp_ind_ttest pvalue_mannWhitney
0 Control Group1 -0.180804 -0.585926 0.214224 95.0 True False 3.959867e-01 3.631777e-01
1 Group2 Group3 0.700802 0.273723 1.126667 95.0 True False 2.720823e-03 2.555804e-03
2 Group4 Group5 -5.106777 -5.533075 -4.636955 95.0 True False 7.985186e-35 1.435085e-14

Controlling Aesthetics


In [16]:
# Changing the contrast y-limits.
f, b = bsc.contrastplot(df,
                       idx=('Control','Group1','Group2'),
                       color_col='Gender',
                       contrast_ylim=(-2,2))



In [17]:
# Changing the swarmplot y-limits.
f, b = bsc.contrastplot(df,
                       idx=('Control','Group1','Group2'),
                       color_col='Gender',
                       swarm_ylim=(-10,10))



In [18]:
# Changing the size of the dots in the swarmplot.
# This is done through swarmplot_kwargs, which accepts a dictionary.
# You can pass any keywords that sns.swarmplot can accept.
f, b = bsc.contrastplot(df,
                       idx=('Control','Group1','Group2'),
                       color_col='Gender',
                       swarmplot_kwargs={'size':10} 
                      )



In [19]:
# Custom y-axis labels.
f, b = bsc.contrastplot(df,
                       idx=('Control','Group1','Group2'),
                       color_col='Gender',
                       swarm_label='My Custom\nSwarm Label',
                       contrast_label='This is the\nContrast Plot'
                       )



In [20]:
# Showing a histogram for the mean summary instead of a horizontal line.
f, b = bsc.contrastplot(df,
                       idx=('Control','Group1','Group4'),
                       color_col='Gender',
                       show_means='bars',
                       means_width=0.6 # Changes the width of the summary bar or the summary line.
                      )



In [21]:
# Passing a list as a custom palette.
f, b = bsc.contrastplot(df,
                        idx=('Control','Group1','Group4'),
                        color_col='Gender',
                        show_means='bars',
                        means_width=0.6,
                        custom_palette=['green', 'tomato'],
                       )



In [22]:
# Passing a dict as a custom palette.
f, b = bsc.contrastplot(df,
                        idx=('Control','Group1','Group4'),
                        color_col='Gender',
                        show_means='bars',
                        means_width=0.6,
                        custom_palette=dict(Male='grey', Female='green')
                       )



In [23]:
# custom y-axis labels for both swarmplots and violinplots.
f, b = bsc.contrastplot(df,
                        idx=('Control','Group1','Group4'),
                        color_col='Gender',
                        swarm_label='my swarm',
                        contrast_label='The\nContrasts' # add line break.
                       )


Appendix: On working with 'melted' DataFrames.

bsc.contrastplot can also work with 'melted' or 'longform' data. This term is so used because each row will now correspond to a single datapoint, with one column carrying the value (value) and other columns carrying 'metadata' describing that datapoint (in this case, group and Gender).

For more details on wide vs long or 'melted' data, see https://en.wikipedia.org/wiki/Wide_and_narrow_data

To read more about melting a dataframe,see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html


In [24]:
x='group'
y='my_metric'
color_col='Gender'

df_melt=pd.melt(df.reset_index(),
                id_vars=['index',color_col],
                value_vars=cols,value_name=y,var_name=x)

df_melt.head() # Gives the first five rows of `df_melt`.


Out[24]:
index Gender group my_metric
0 0 Male Control 1.331587
1 1 Male Control 0.715279
2 2 Male Control -1.545400
3 3 Male Control -0.008384
4 4 Male Control 0.621336

If you are using a melted DataFrame, you will need to specify the x (containing the categorical group names) and y (containing the numerical values for plotting) columns.


In [25]:
df_melt


Out[25]:
index Gender group my_metric
0 0 Male Control 1.331587
1 1 Male Control 0.715279
2 2 Male Control -1.545400
3 3 Male Control -0.008384
4 4 Male Control 0.621336
5 5 Male Control -0.720086
6 6 Male Control 0.265512
7 7 Male Control 0.108549
8 8 Male Control 0.004291
9 9 Male Control -0.174600
10 10 Male Control 0.433026
11 11 Male Control 1.203037
12 12 Male Control -0.965066
13 13 Male Control 1.028274
14 14 Male Control 0.228630
15 15 Male Control 0.445138
16 16 Male Control -1.136602
17 17 Male Control 0.135137
18 18 Male Control 1.484537
19 19 Male Control -1.079805
20 20 Female Control -1.977728
21 21 Female Control -1.743372
22 22 Female Control 0.266070
23 23 Female Control 2.384967
24 24 Female Control 1.123691
25 25 Female Control 1.672622
26 26 Female Control 0.099149
27 27 Female Control 1.397996
28 28 Female Control -0.271248
29 29 Female Control 0.613204
... ... ... ... ...
210 10 Male Group5 -1.220654
211 11 Male Group5 -0.609284
212 12 Male Group5 -0.241531
213 13 Male Group5 -0.548351
214 14 Male Group5 -1.621476
215 15 Male Group5 -0.340670
216 16 Male Group5 -1.179230
217 17 Male Group5 0.760236
218 18 Male Group5 -0.250210
219 19 Male Group5 -0.983632
220 20 Female Group5 -1.096558
221 21 Female Group5 -2.080330
222 22 Female Group5 -0.866140
223 23 Female Group5 -2.251181
224 24 Female Group5 -0.616097
225 25 Female Group5 -3.044364
226 26 Female Group5 -2.283900
227 27 Female Group5 0.567387
228 28 Female Group5 0.646222
229 29 Female Group5 0.418925
230 30 Female Group5 -2.992920
231 31 Female Group5 -2.648138
232 32 Female Group5 -2.595158
233 33 Female Group5 -2.863298
234 34 Female Group5 -0.750010
235 35 Female Group5 -1.538708
236 36 Female Group5 -1.000581
237 37 Female Group5 -1.539278
238 38 Female Group5 -1.872530
239 39 Female Group5 1.253789

240 rows × 4 columns


In [26]:
df


Out[26]:
Control Group1 Group2 Group3 Group4 Group5 Gender
0 1.331587 1.749455 0.372986 -0.512391 5.706473 -1.343561 Male
1 0.715279 -0.286073 -0.781426 0.953766 4.087105 -0.626787 Male
2 -1.545400 -0.484565 0.142439 0.155497 4.191374 -1.171499 Male
3 -0.008384 -2.653319 -1.800736 0.651812 3.920430 -1.551969 Male
4 0.621336 -0.008285 0.653143 1.545102 1.795238 -0.740874 Male
5 -0.720086 -0.319631 -1.634721 0.732338 4.159146 -2.939966 Male
6 0.265512 -0.536629 -0.094873 1.550188 2.348715 -2.205448 Male
7 0.108549 0.315403 -0.220228 1.061211 4.232220 -2.196542 Male
8 0.004291 0.421051 -0.906982 1.678686 3.385974 -1.335687 Male
9 -0.174600 -1.065603 2.771819 -0.845377 5.192982 -1.521123 Male
10 0.433026 -0.886240 -0.697823 -0.588989 3.795082 -1.220654 Male
11 1.203037 -0.475733 0.372457 -1.061606 4.016128 -0.609284 Male
12 -0.965066 0.689682 0.995956 0.762847 2.816874 -0.241531 Male
13 1.028274 0.561192 -1.315169 -0.043326 4.706477 -0.548351 Male
14 0.228630 -1.305549 1.242356 1.113741 3.801630 -1.621476 Male
15 0.445138 -1.119475 -0.222150 0.517351 4.682330 -0.340670 Male
16 -1.136602 0.736837 0.912515 0.327303 4.892072 -1.179230 Male
17 0.135137 1.574634 -1.013869 2.350383 4.855729 0.760236 Male
18 1.484537 -0.031075 -1.129530 0.806289 3.738761 -0.250210 Male
19 -1.079805 -0.683447 1.109796 0.173228 1.918896 -0.983632 Male
20 -1.977728 1.095630 0.401872 -0.784161 2.710666 -1.096558 Female
21 -1.743372 -0.309577 0.038846 1.390705 4.919828 -2.080330 Female
22 0.266070 0.725752 0.540761 1.152831 5.110201 -0.866140 Female
23 2.384967 1.549072 0.427333 -0.887182 5.422409 -2.251181 Female
24 1.123691 0.630080 -1.254360 0.054789 3.395736 -0.616097 Female
25 1.672622 0.073493 -2.313333 0.437858 2.920116 -3.044364 Female
26 0.099149 0.732271 -1.781757 -1.439093 5.006140 -2.283900 Female
27 1.397996 -0.642575 -1.888094 -0.078135 4.960377 0.567387 Female
28 -0.271248 -0.178093 -2.318535 1.599238 4.024322 0.646222 Female
29 0.613204 -0.573955 -0.747431 -1.415108 3.995442 0.418925 Female
30 -0.267317 -0.204375 -0.628404 0.690872 2.515961 -2.992920 Female
31 -0.549309 -0.486495 -0.139209 2.092742 3.770960 -2.648138 Female
32 0.132708 -0.185775 0.114976 -0.420980 5.459194 -2.595158 Female
33 -0.476142 -0.380536 -0.484359 -0.253752 2.992113 -2.863298 Female
34 1.308473 0.088978 -0.353904 0.417452 3.482986 -0.750010 Female
35 0.195013 0.063672 -0.026748 0.714329 3.835613 -1.538708 Female
36 0.400210 0.296347 -1.097204 0.597241 3.642214 -1.000581 Female
37 -0.337632 1.402771 -0.813856 -1.312845 2.013704 -1.539278 Female
38 1.256472 -1.546863 -0.064584 -0.564034 3.449593 -1.872530 Female
39 -0.731970 1.295619 -0.777945 0.301270 3.378741 1.253789 Female

In [27]:
f, b = bsc.contrastplot(df_melt,
                        x='group',
                        y='my_metric',
                        fig_size=(4,6),
                        idx=('Control','Group1'),
                        color_col='Gender',
                        paired=True
                       )
b


Out[27]:
reference_group experimental_group stat_summary bca_ci_low bca_ci_high ci is_difference is_paired pvalue_2samp_paired_ttest pvalue_wilcoxon
0 Control Group1 -0.180804 -0.574237 0.250589 95.0 True True 0.3931 0.353693

In [ ]: