In [1]:
import pandas as pd
from pydiffexp import DEAnalysis
Each DEAnalysis object (DEA) operates on a specific dataset. DEA uses a hierarchical dataframe (i.e. a dataframe with a multiindex) for analysis. One can either be supplied, or can be created from a dataframe with appropriate column or row labels. DEA expects the multiindex to be along the columns and will transform the data if necessary. DEA can also be initilized without data, but many methods will not work as expected.
In [2]:
test_path = "/Users/jfinkle/Documents/Northwestern/MoDyLS/Python/sprouty/data/raw_data/all_data_formatted.csv"
raw_data = pd.read_csv(test_path, index_col=0)
# Initialize analysis object with data. Data is retained
'''
The hierarchy provides the names for each label in the multiindex. 'condition' and 'time' are supplied as the reference
labels, which are used to make contrasts.
'''
hierarchy = ['condition', 'well', 'time', 'replicate']
dea = DEAnalysis(raw_data, index_names=hierarchy, reference_labels=['condition', 'time'] )
Let's look at the data that has been added to the object. Notice that the columns are a Multiindex in which the levels correspond to lists of the possible values and the names of each level come from the list supplied to index_names
In [3]:
raw_data.head()
Out[3]:
In [4]:
dea.data.head()
Out[4]:
In [5]:
dea.data.columns
Out[5]:
When the data is added, DEA automatically saves a summary of the experiment, which can also be summarized with the print function.
In [6]:
dea.experiment_summary
Out[6]:
In [7]:
dea.print_experiment_summary()
Now we're ready to fit a model! All we need to do is supply contrasts that we want to compare. These are formatted in the R style and can either be a string, list, or dictionary. Here we'll just do one contrast, so we supply a string. When the fit is run, DEA gains several new attributes that store the data, design, contrast, and fit objects created by R.
All of the model information is kept as attributes so that the entire object can be saved and the analysis can be recapitulated.
In [8]:
# Types of contrasts
c_dict = {'Diff0': "(KO_15-KO_0)-(WT_15-WT_0)", 'Diff15': "(KO_60-KO_15)-(WT_60-WT_15)",
'Diff60': "(KO_120-KO_60)-(WT_120-WT_60)", 'Diff120': "(KO_240-KO_120)-(WT_240-WT_120)"}
c_list = ["KO_15-KO_0", "KO_60-KO_15", "KO_120-KO_60", "KO_240-KO_120"]
c_string = "KO_0-WT_0"
dea.fit(c_string)
In [9]:
print(dea.design, '', dea.contrast_robj, '', dea.de_fit)
After the fit, we want to see our significant results. DEA calls topTable so all keywoard arguments from the R function can be passed, though the defaults explicitly in get_results() are the most commonly used ones. If more than one contrast is supplied, pydiffexp will default to using the F statistic when selecting significant genes.
In [10]:
dea.get_results(p_value=0.01, n=10)
Out[10]: