In research papers, it is common for the first table ("Table 1") to display summary statistics of the study data. The tableone
package is used to create this table. For an introduction to basic statistical reporting in biomedical journals, we recommend reading the SAMPL Guidelines. For more reading on accurate reporting in health research, visit the EQUATOR Network.
Set up:
Example usage:
Exporting the table:
While we have tried to use best practices in creating this package, automation of even basic statistical tasks can be unsound if done without supervision. We encourage use of tableone
alongside other methods of descriptive statistics and, in particular, visualization to ensure appropriate data handling.
It is beyond the scope of our documentation to provide detailed guidance on summary statistics, but as a primer we provide some considerations for choosing parameters when creating a summary table at: http://tableone.readthedocs.io/en/latest/bestpractice.html.
Guidance should be sought from a statistician when using tableone
for a research study, especially prior to submitting the study for publication.
If you use tableone in your study, please cite the following paper:
Tom J Pollard, Alistair E W Johnson, Jesse D Raffa, Roger G Mark; tableone: An open source Python package for producing summary statistics for research papers, JAMIA Open, Volume 1, Issue 1, 1 July 2018, Pages 26–31, https://doi.org/10.1093/jamiaopen/ooy012
Download the BibTex file from: https://academic.oup.com/jamiaopen/downloadcitation/5001910?format=bibtex
To install the package with pip, run the following command in your terminal: pip install tableone
. To install the package with Conda, run: conda install -c conda-forge tableone
. For more detailed installation instructions, refer to the documentation.
In [1]:
# import numerical libraries
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
# import tableone
try:
from tableone import TableOne, load_dataset
except (ModuleNotFoundError, ImportError):
# install on Colab
!pip install tableone
from tableone import TableOne, load_dataset
In [3]:
# load PhysioNet 2012 sample data
data = load_dataset('pn2012')
In [4]:
data.head()
Out[4]:
In [5]:
# view the tableone docstring
TableOne??
In [6]:
# create an instance of TableOne with the input arguments
# firstly, with no grouping variable
table1 = TableOne(data)
In [7]:
# view the table
table1
Out[7]:
In [8]:
# the pd.DataFrame object can be accessed using the `tableone` attribute
type(table1.tableone)
Out[8]:
Summary of the table:
n
') displays a count of the encounters/observations in the input data.Missing
' column displays a count of the null values for the particular variable.age
') are summarized by 'mean (std)
'.ascites
') are summarized by 'n (% of non-null values)
'.label_suffix=True
, "mean (SD); n (%);" etc are appended to the row label.
In [9]:
data[['Age','SysABP','Height']].dropna().plot.kde(figsize=[12,8])
plt.legend(['Age (years)', 'SysABP (mmHg)', 'Height (cm)'])
plt.xlim([-30,250])
Out[9]:
In [10]:
data[['Age','Height','SysABP']].boxplot(whis=3)
plt.show()
In both cases it seems that there are values that may need to be taken into account when calculating the summary statistics. For SysABP
, a clearly bimodal distribution, the researcher will need to decide how to handle the peak at ~0, perhaps by cleaning the data and/or describing the issue in the summary table. For Height
, the researcher may choose to report median, rather than mean.
In [11]:
# columns to summarize
columns = ['Age', 'SysABP', 'Height', 'Weight', 'ICU', 'death']
# columns containing categorical variables
categorical = ['ICU']
# non-normal variables
nonnormal = ['Age']
# limit the binary variable "death" to a single row
limit = {"death": 1}
# set the order of the categorical variables
order = {"ICU": ["MICU", "SICU", "CSRU", "CCU"]}
# alternative labels
labels={'death': 'Mortality'}
# set decimal places for age to 0
decimals = {"Age": 0}
# optionally, a categorical variable for stratification
groupby = ['death']
# rename the death column
labels={'death': 'Mortality'}
# display minimum and maximum for listed variables
min_max = ['Height']
table2 = TableOne(data, columns=columns, categorical=categorical, groupby=groupby,
nonnormal=nonnormal, rename=labels, label_suffix=True,
decimals=decimals, limit=limit, min_max=min_max)
table2
Out[11]:
Summary of the table:
columns
argument.limit
argument specifies that only a 1 value should be shown for death.order
argument.nonnormal
continuous variables are summarized by 'median [Q1,Q3]
' instead of mean (SD)
.rename
argument.groupby
argument.min_max
displays [minimum, maximum] for the variable, instead of standard deviation or upper/lower quartiles.
In [12]:
# create grouped_table with p values
table3 = TableOne(data, columns, categorical, groupby, nonnormal, pval = True, smd=True,
htest_name=True)
In [13]:
# view first 10 rows of tableone
table3
Out[13]:
Summary of the table:
htest_name
argument can be used to display the name of the hypothesis tests used.p-value
' column displays the p value generated to 3 decimal places.Custom hypothesis tests can be defined using the htest
argument, which takes a dictionary of variable: function pairs (i.e. htest = {var: custom_func}
, where var
is the variable and custom_func
is a function that takes lists of values in each group. The custom function must return a single pval
value.
In [14]:
# load PhysioNet 2012 sample data
data = load_dataset('pn2012')
In [15]:
# define the custom tests
# `*` allows the function to take an unknown number of arguments
def my_custom_test(group1, group2):
"""
Hypothesis test for test_self_defined_statistical_tests
"""
my_custom_test.__name__ = "Custom test 1"
_, pval= stats.ks_2samp(group1, group2)
return pval
# If the number of groups is unknown, use *args
def my_custom_test2(*args):
"""
Hypothesis test for test_self_defined_statistical_tests
"""
# uncomment the following chunk to view the first 10 values in each group
for n, a in enumerate(args):
print("Group {} (total {} values.): {} ...".format(n, len(a), a[:10]))
my_custom_test2.__name__ = "Custom test 2"
_, pval= stats.ks_2samp(*args)
return pval
custom_tests = {'Age': my_custom_test, 'SysABP': my_custom_test2}
In [16]:
# create the table
table4 = TableOne(data, groupby="death", pval=True, htest_name=True, htest=custom_tests)
In [17]:
table4
Out[17]:
The tableone object includes a tabulate
method, that makes use of the tabulate package to display the table in custom output formats. Supported table formats include: "github", "grid", "fancy_grid", "rst", "html", "latex", and "latex_raw". See the tabulate package for more formats.
To export your table in LaTex (for example, to add to your document on Overleaf.com), it's simple with the tabulate
method. Just copy and paste the output below.
In [18]:
# load PhysioNet 2012 sample data
data = load_dataset('pn2012')
In [19]:
# create the table
table5 = TableOne(data, groupby="death")
In [20]:
print(table5.tabulate(tablefmt = "latex"))
In [21]:
print(table5.tabulate(tablefmt = "github"))
In [22]:
# Save to Excel
fn1 = 'tableone.xlsx'
table5.to_excel(fn1)
# Save table to LaTeX
fn2 = 'tableone.tex'
table5.to_latex(fn2)
# Save table to HTML
fn3 = 'tableone.html'
table5.to_html(fn3)