In [1]:
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import bootstrap_contrast as bsc
import pandas as pd
import numpy as np
import scipy as sp
In [2]:
dataset=list()
for seed in [10,11,12,13,14,15]:
np.random.seed(seed) # fix the seed so we get the same numbers each time.
dataset.append(np.random.randn(40))
df=pd.DataFrame(dataset).T
cols=['Control','Group1','Group2','Group3','Group4','Group5']
df.columns=cols
# Create some upwards/downwards shifts.
df['Group2']=df['Group2']-0.1
df['Group3']=df['Group3']+0.2
df['Group4']=(df['Group4']*1.1)+4
df['Group5']=(df['Group5']*1.1)-1
# Add gender column.
df['Gender']=np.concatenate([np.repeat('Male',20),np.repeat('Female',20)])
Note that we have 6 groups of observations, with an additional non-numerical column indicating gender.
bootstrap classHere, we introduce a new class called bootstrap. Essentially, it will compute the summary statistic and its associated confidence interval using bootstrapping. It can do this for a single group of observations, or for two groups of observations (both paired and unpaired).
Below, I obtain the bootstrapped contrast for 'Control' and 'Group1' in df.
In [3]:
contr = bsc.bootstrap(df['Control'],df['Group1'])
As mentioned above, contr is a bootstrap object. Calling it directly will not produce anything.
In [4]:
contr
Out[4]:
It has several callable attributes. Of interest is its results attribute, which returns a dictionary summarising the results of the contrast computation.
In [5]:
contr.results
Out[5]:
is_paired indicates the two arrays are paired (or repeated) observations. This is indicated by the paired flag.
In [6]:
contr_paired = bsc.bootstrap(df['Control'],df['Group1'],
paired=True)
contr_paired.results
Out[6]:
is_difference basically indicates if one or two arrays were passed to the bootstrap function. Obseve what happens if we just give one array.
In [7]:
just_control = bsc.bootstrap(df['Control'])
just_control.results
Out[7]:
Here, the confidence interval is with respect to the mean of the group Control.
There are several other statistics the bootstrap object contains. Please do have a look at its documentation. Below, I print the p-values for contr_paired as an example.
In [8]:
contr_paired.pvalue_2samp_paired_ttest
Out[8]:
In [9]:
contr_paired.pvalue_wilcoxon
Out[9]:
Below we produce three aligned Gardner-Altman floating contrast plots.
The contrastplot command will return 2 objects: a matplotlib Figure and a pandas DataFrame.
In the Jupyter Notebook, with %matplotlib inline, the figure should automatically appear.
bs.bootstrap will automatically drop any NaNs in the data. Note how the Ns (appended to the group names in the xtick labels) indicate the number of datapoints being plotted, and used to calculate the contrasts.
The pandas DataFrame returned by bs.bootstrap contains the pairwise comparisons made in the course of generating the plot, with confidence intervals (95% by default) and relevant p-values.
In [10]:
f, b = bsc.contrastplot(df,
idx=('Control','Group1'),
color_col='Gender',
fig_size=(4,6) # The length and width of the image, in inches.
)
b
Out[10]:
In [11]:
f, b = bsc.contrastplot(df,
idx=('Control','Group2'),
color_col='Gender',
paired=True,
fig_size=(4,6))
b
Out[11]:
If you want to plot the raw swarmplot instead of the paired lines, use the show_pairs flag to set this. The contrasts computed will still be paired, as indicated by the DataFrame produced.
In [12]:
f, b = bsc.contrastplot(df,
idx=('Control','Group2'),
color_col='Gender',
paired=True,
show_pairs=False,
fig_size=(4,6))
b
Out[12]:
In a multi-plot design, you can horizontally tile two or more two-group floating-contrasts. This is designed to meet data visualization and presentation paradigms that are predominant in academic biomedical research.
This is done mainly through the idx option. You can indicate two or more tuples to create a seperate subplot for that contrast.
The effect sizes and confidence intervals for each two-group plot will be computed.
In [13]:
f, b = bsc.contrastplot(df,
idx=(('Control','Group1'),
('Group2','Group3')),
paired=True,
show_means='lines',
color_col='Gender')
b
Out[13]:
A common experimental design seen in contemporary biomedical research is a shared-control, or 'hub-and-spoke' design. Two or more experimental groups are compared to a common control group.
A hub-and-spoke plot implements estimation statistics and aesthetics on such an experimental design.
If more than 2 columns/groups are indicated in a tuple passed to idx, then contrastplot will produce a hub-and-spoke plot, where the first group in the tuple is considered the control group. The mean difference and confidence intervals of each subsequent group will be computed against the first control group.
In [14]:
f, b = bsc.contrastplot(df,
idx=df.columns[:-1],
color_col='Gender')
b
Out[14]:
In [15]:
f, b = bsc.contrastplot(df,
idx=(('Control','Group1'),('Group2','Group3'),
('Group4','Group5')),
color_col='Gender')
b
Out[15]:
In [16]:
# Changing the contrast y-limits.
f, b = bsc.contrastplot(df,
idx=('Control','Group1','Group2'),
color_col='Gender',
contrast_ylim=(-2,2))
In [17]:
# Changing the swarmplot y-limits.
f, b = bsc.contrastplot(df,
idx=('Control','Group1','Group2'),
color_col='Gender',
swarm_ylim=(-10,10))
In [18]:
# Changing the size of the dots in the swarmplot.
# This is done through swarmplot_kwargs, which accepts a dictionary.
# You can pass any keywords that sns.swarmplot can accept.
f, b = bsc.contrastplot(df,
idx=('Control','Group1','Group2'),
color_col='Gender',
swarmplot_kwargs={'size':10}
)
In [19]:
# Custom y-axis labels.
f, b = bsc.contrastplot(df,
idx=('Control','Group1','Group2'),
color_col='Gender',
swarm_label='My Custom\nSwarm Label',
contrast_label='This is the\nContrast Plot'
)
In [20]:
# Showing a histogram for the mean summary instead of a horizontal line.
f, b = bsc.contrastplot(df,
idx=('Control','Group1','Group4'),
color_col='Gender',
show_means='bars',
means_width=0.6 # Changes the width of the summary bar or the summary line.
)
In [21]:
# Passing a list as a custom palette.
f, b = bsc.contrastplot(df,
idx=('Control','Group1','Group4'),
color_col='Gender',
show_means='bars',
means_width=0.6,
custom_palette=['green', 'tomato'],
)
In [22]:
# Passing a dict as a custom palette.
f, b = bsc.contrastplot(df,
idx=('Control','Group1','Group4'),
color_col='Gender',
show_means='bars',
means_width=0.6,
custom_palette=dict(Male='grey', Female='green')
)
In [23]:
# custom y-axis labels for both swarmplots and violinplots.
f, b = bsc.contrastplot(df,
idx=('Control','Group1','Group4'),
color_col='Gender',
swarm_label='my swarm',
contrast_label='The\nContrasts' # add line break.
)
bsc.contrastplot can also work with 'melted' or 'longform' data. This term is so used because each row will now correspond to a single datapoint, with one column carrying the value (value) and other columns carrying 'metadata' describing that datapoint (in this case, group and Gender).
For more details on wide vs long or 'melted' data, see https://en.wikipedia.org/wiki/Wide_and_narrow_data
To read more about melting a dataframe,see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html
In [24]:
x='group'
y='my_metric'
color_col='Gender'
df_melt=pd.melt(df.reset_index(),
id_vars=['index',color_col],
value_vars=cols,value_name=y,var_name=x)
df_melt.head() # Gives the first five rows of `df_melt`.
Out[24]:
If you are using a melted DataFrame, you will need to specify the x (containing the categorical group names) and y (containing the numerical values for plotting) columns.
In [25]:
df_melt
Out[25]:
In [26]:
df
Out[26]:
In [27]:
f, b = bsc.contrastplot(df_melt,
x='group',
y='my_metric',
fig_size=(4,6),
idx=('Control','Group1'),
color_col='Gender',
paired=True
)
b
Out[27]:
In [ ]: