Pydiffexp

The pydiffexp package is meant to provide an interface between R and Python to do differential expression analysis.

Imports


In [1]:
import pandas as pd
from pydiffexp import DEAnalysis

Load Data

Each DEAnalysis object (DEA) operates on a specific dataset. DEA uses a hierarchical dataframe (i.e. a dataframe with a multiindex) for analysis. One can either be supplied, or can be created from a dataframe with appropriate column or row labels. DEA expects the multiindex to be along the columns and will transform the data if necessary. DEA can also be initilized without data, but many methods will not work as expected.


In [2]:
test_path = "/Users/jfinkle/Documents/Northwestern/MoDyLS/Python/sprouty/data/raw_data/all_data_formatted.csv"
raw_data = pd.read_csv(test_path, index_col=0)

# Initialize analysis object with data. Data is retained

'''
The hierarchy provides the names for each label in the multiindex. 'condition' and 'time' are supplied as the reference
labels, which are used to make contrasts. 
''' 
hierarchy = ['condition', 'well', 'time', 'replicate']
dea = DEAnalysis(raw_data, index_names=hierarchy, reference_labels=['condition', 'time'] )

Let's look at the data that has been added to the object. Notice that the columns are a Multiindex in which the levels correspond to lists of the possible values and the names of each level come from the list supplied to index_names

Raw Data


In [3]:
raw_data.head()


Out[3]:
WT_1_0_A WT_2_0_B WT_3_0_C WT_4_15_A WT_5_15_B WT_6_15_C WT_7_60_A WT_8_60_B WT_9_60_C WT_10_120_A ... KO_22_60_A KO_23_60_B KO_24_60_C KO_25_120_A KO_26_120_B KO_27_120_C KO_28_240_A KO_29_240_B KO_30_240_C KO_32_0_D
TargetID
0610006I08RIK 244.6 234.3 272.2 280.4 236.0 263.2 224.0 271.0 246.6 265.9 ... 142.9 170.1 143.9 156.8 169.1 171.1 174.1 157.5 162.1 174.6
0610007C21RIK 1259.9 1269.9 1450.2 1363.3 1307.2 1406.4 1222.9 1283.8 1197.2 1385.5 ... 1646.7 1728.6 1495.5 1815.2 1871.5 1847.7 1367.5 1545.0 1383.0 1695.7
0610007P08RIK 29.0 23.5 26.1 27.6 23.5 26.1 22.5 23.0 17.7 13.4 ... 18.0 21.2 18.7 24.2 53.1 7.5 25.0 22.6 16.5 14.2
0610007P14RIK 607.5 540.0 589.6 643.7 724.6 638.9 437.6 409.3 545.6 561.7 ... 1245.0 1477.0 1310.1 1827.0 1928.7 1697.4 1399.4 1675.5 1269.6 1856.9
0610007P22RIK 529.4 564.8 646.0 490.9 553.9 577.6 481.7 556.6 449.6 590.8 ... 356.2 304.5 319.8 402.7 439.3 408.4 450.5 459.3 409.4 427.5

5 rows × 32 columns

Formatted data as Hierarchial Dataframe


In [4]:
dea.data.head()


Out[4]:
condition WT ... KO
well 1 2 3 4 5 6 7 8 9 10 ... 22 23 24 25 26 27 28 29 30 32
time 0 0 0 15 15 15 60 60 60 120 ... 60 60 60 120 120 120 240 240 240 0
replicate A B C A B C A B C A ... A B C A B C A B C D
TargetID
0610006I08RIK 244.6 234.3 272.2 280.4 236.0 263.2 224.0 271.0 246.6 265.9 ... 142.9 170.1 143.9 156.8 169.1 171.1 174.1 157.5 162.1 174.6
0610007C21RIK 1259.9 1269.9 1450.2 1363.3 1307.2 1406.4 1222.9 1283.8 1197.2 1385.5 ... 1646.7 1728.6 1495.5 1815.2 1871.5 1847.7 1367.5 1545.0 1383.0 1695.7
0610007P08RIK 29.0 23.5 26.1 27.6 23.5 26.1 22.5 23.0 17.7 13.4 ... 18.0 21.2 18.7 24.2 53.1 7.5 25.0 22.6 16.5 14.2
0610007P14RIK 607.5 540.0 589.6 643.7 724.6 638.9 437.6 409.3 545.6 561.7 ... 1245.0 1477.0 1310.1 1827.0 1928.7 1697.4 1399.4 1675.5 1269.6 1856.9
0610007P22RIK 529.4 564.8 646.0 490.9 553.9 577.6 481.7 556.6 449.6 590.8 ... 356.2 304.5 319.8 402.7 439.3 408.4 450.5 459.3 409.4 427.5

5 rows × 32 columns


In [5]:
dea.data.columns


Out[5]:
MultiIndex(levels=[['KO', 'WT'], ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '4', '5', '6', '7', '8', '9'], ['0', '120', '15', '240', '60'], ['A', 'B', 'C', 'D']],
           labels=[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 11, 22, 26, 27, 28, 29, 30, 31, 1, 2, 3, 4, 5, 6, 24, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23, 25], [0, 0, 0, 2, 2, 2, 4, 4, 4, 1, 1, 1, 3, 3, 3, 0, 0, 0, 0, 2, 2, 2, 4, 4, 4, 1, 1, 1, 3, 3, 3, 0], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 3]],
           names=['condition', 'well', 'time', 'replicate'])

When the data is added, DEA automatically saves a summary of the experiment, which can also be summarized with the print function.


In [6]:
dea.experiment_summary


Out[6]:
condition well time replicate sample_id
0 WT 1 0 A 5
1 WT 2 0 B 5
2 WT 3 0 C 5
3 WT 4 15 A 7
4 WT 5 15 B 7
5 WT 6 15 C 7
6 WT 7 60 A 9
7 WT 8 60 B 9
8 WT 9 60 C 9
9 WT 10 120 A 6
10 WT 11 120 B 6
11 WT 12 120 C 6
12 WT 13 240 A 8
13 WT 14 240 B 8
14 WT 15 240 C 8
15 WT 31 0 D 5
16 KO 16 0 A 0
17 KO 17 0 B 0
18 KO 18 0 C 0
19 KO 19 15 A 2
20 KO 20 15 B 2
21 KO 21 15 C 2
22 KO 22 60 A 4
23 KO 23 60 B 4
24 KO 24 60 C 4
25 KO 25 120 A 1
26 KO 26 120 B 1
27 KO 27 120 C 1
28 KO 28 240 A 3
29 KO 29 240 B 3
30 KO 30 240 C 3
31 KO 32 0 D 0

In [7]:
dea.print_experiment_summary()


conditions: 2
wells: 32
times: 5
replicates: 4
sample_ids: 10

Model Fitting

Now we're ready to fit a model! All we need to do is supply contrasts that we want to compare. These are formatted in the R style and can either be a string, list, or dictionary. Here we'll just do one contrast, so we supply a string. When the fit is run, DEA gains several new attributes that store the data, design, contrast, and fit objects created by R.

All of the model information is kept as attributes so that the entire object can be saved and the analysis can be recapitulated.


In [8]:
# Types of contrasts
c_dict = {'Diff0': "(KO_15-KO_0)-(WT_15-WT_0)", 'Diff15': "(KO_60-KO_15)-(WT_60-WT_15)",
          'Diff60': "(KO_120-KO_60)-(WT_120-WT_60)", 'Diff120': "(KO_240-KO_120)-(WT_240-WT_120)"}
c_list = ["KO_15-KO_0", "KO_60-KO_15", "KO_120-KO_60", "KO_240-KO_120"]
c_string = "KO_0-WT_0"
dea.fit(c_string)


/Volumes/Hephaestus/jfinkle/Documents/Northwestern/MoDyLS/Python/pydiffexp/pydiffexp/diffexp.py:204: UserWarning: NaNs detected during log expression transformation. Setting to NaN values to zero.
  warnings.warn("NaNs detected during log expression transformation. Setting to NaN values to zero.")

In [9]:
print(dea.design, '', dea.contrast_robj, '', dea.de_fit)


   KO_0 KO_120 KO_15 KO_240 KO_60 WT_0 WT_120 WT_15 WT_240 WT_60
1     0      0     0      0     0    1      0     0      0     0
2     0      0     0      0     0    1      0     0      0     0
3     0      0     0      0     0    1      0     0      0     0
4     0      0     0      0     0    0      0     1      0     0
5     0      0     0      0     0    0      0     1      0     0
6     0      0     0      0     0    0      0     1      0     0
7     0      0     0      0     0    0      0     0      0     1
8     0      0     0      0     0    0      0     0      0     1
9     0      0     0      0     0    0      0     0      0     1
10    0      0     0      0     0    0      1     0      0     0
11    0      0     0      0     0    0      1     0      0     0
12    0      0     0      0     0    0      1     0      0     0
13    0      0     0      0     0    0      0     0      1     0
14    0      0     0      0     0    0      0     0      1     0
15    0      0     0      0     0    0      0     0      1     0
16    0      0     0      0     0    1      0     0      0     0
17    1      0     0      0     0    0      0     0      0     0
18    1      0     0      0     0    0      0     0      0     0
19    1      0     0      0     0    0      0     0      0     0
20    0      0     1      0     0    0      0     0      0     0
21    0      0     1      0     0    0      0     0      0     0
22    0      0     1      0     0    0      0     0      0     0
23    0      0     0      0     1    0      0     0      0     0
24    0      0     0      0     1    0      0     0      0     0
25    0      0     0      0     1    0      0     0      0     0
26    0      1     0      0     0    0      0     0      0     0
27    0      1     0      0     0    0      0     0      0     0
28    0      1     0      0     0    0      0     0      0     0
29    0      0     0      1     0    0      0     0      0     0
30    0      0     0      1     0    0      0     0      0     0
31    0      0     0      1     0    0      0     0      0     0
32    1      0     0      0     0    0      0     0      0     0
attr(,"assign")
 [1] 1 1 1 1 1 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$x
[1] "contr.treatment"

          Contrasts
Levels   KO_0-WT_0
  KO_0           1
  KO_120         0
  KO_15          0
  KO_240         0
  KO_60          0
  WT_0          -1
  WT_120         0
  WT_15          0
  WT_240         0
  WT_60          0
  An object of class "MArrayLM"
$coefficients
0610006I08RIK 0610007C21RIK 0610007P08RIK 0610007P14RIK 0610007P22RIK 
   -0.6045413     0.3275503    -0.3054664     1.5865863    -0.4143036 
18133 more rows ...

$rank
[1] 10

$assign
 [1] 1 1 1 1 1 1 1 1 1 1

$qr
$qr
  KO_0    KO_120     KO_15    KO_240     KO_60 WT_0 WT_120 WT_15 WT_240 WT_60
1   -2  0.000000  0.000000  0.000000  0.000000    0      0     0      0     0
2    0 -1.732051  0.000000  0.000000  0.000000    0      0     0      0     0
3    0  0.000000 -1.732051  0.000000  0.000000    0      0     0      0     0
4    0  0.000000  0.000000 -1.732051  0.000000    0      0     0      0     0
5    0  0.000000  0.000000  0.000000 -1.732051    0      0     0      0     0
27 more rows ...

$qraux
 [1] 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
 [9] 1.000000 1.333333

$pivot
 [1]  1  2  3  4  5  6  7  8  9 10

$tol
[1] 1e-07

$rank
[1] 10


$df.residual
[1] 22 22 22 22 22
18133 more elements ...

$sigma
0610006I08RIK 0610007C21RIK 0610007P08RIK 0610007P14RIK 0610007P22RIK 
   0.10600234    0.08928556    0.52996546    0.14772904    0.14703985 
18133 more elements ...

$cov.coefficients
           Contrasts
Contrasts   KO_0-WT_0
  KO_0-WT_0       0.5

$stdev.unscaled
0610006I08RIK 0610007C21RIK 0610007P08RIK 0610007P14RIK 0610007P22RIK 
    0.7071068     0.7071068     0.7071068     0.7071068     0.7071068 
18133 more rows ...

$Amean
0610006I08RIK 0610007C21RIK 0610007P08RIK 0610007P14RIK 0610007P22RIK 
     7.658307     10.507483      4.410644      9.867572      8.864430 
18133 more elements ...

$method
[1] "ls"

$design
  KO_0 KO_120 KO_15 KO_240 KO_60 WT_0 WT_120 WT_15 WT_240 WT_60
1    0      0     0      0     0    1      0     0      0     0
2    0      0     0      0     0    1      0     0      0     0
3    0      0     0      0     0    1      0     0      0     0
4    0      0     0      0     0    0      0     1      0     0
5    0      0     0      0     0    0      0     1      0     0
27 more rows ...

$contrasts
        Contrasts
Levels   KO_0-WT_0
  KO_0           1
  KO_120         0
  KO_15          0
  KO_240         0
  KO_60          0
  WT_0          -1
  WT_120         0
  WT_15          0
  WT_240         0
  WT_60          0

$df.prior
[1] 0.8676269

$s2.prior
[1] 0.03086284

$var.prior
[1] 217.1348

$proportion
[1] 0.01

$s2.post
0610006I08RIK 0610007C21RIK 0610007P08RIK 0610007P14RIK 0610007P22RIK 
  0.011981144   0.008840421   0.271378053   0.022166820   0.021971377 
18133 more elements ...

$t
0610006I08RIK 0610007C21RIK 0610007P08RIK 0610007P14RIK 0610007P22RIK 
   -7.8107330     4.9267042    -0.8292607    15.0704772    -3.9528017 
18133 more rows ...

$df.total
[1] 22.86763 22.86763 22.86763 22.86763 22.86763
18133 more elements ...

$p.value
0610006I08RIK 0610007C21RIK 0610007P08RIK 0610007P14RIK 0610007P22RIK 
 6.714101e-08  5.680062e-05  4.155296e-01  2.293052e-13  6.379779e-04 
18133 more rows ...

$lods
0610006I08RIK 0610007C21RIK 0610007P08RIK 0610007P14RIK 0610007P22RIK 
    7.8032489     0.9707584    -7.2803447    20.6395921    -1.4374648 
18133 more rows ...

$F
[1]  61.0075505  24.2724140   0.6876732 227.1192845  15.6246415
18133 more elements ...

$F.p.value
[1] 6.714101e-08 5.680062e-05 4.155296e-01 2.293052e-13 6.379779e-04
18133 more elements ...


After the fit, we want to see our significant results. DEA calls topTable so all keywoard arguments from the R function can be passed, though the defaults explicitly in get_results() are the most commonly used ones. If more than one contrast is supplied, pydiffexp will default to using the F statistic when selecting significant genes.


In [10]:
dea.get_results(p_value=0.01, n=10)


Out[10]:
logFC AveExpr t P.Value adj.P.Val B
PTN 9.494964 4.728120 153.893059 5.464559e-36 9.856972e-32 60.700807
IGFBP7 7.095179 8.116964 59.201657 1.579788e-26 1.424811e-22 48.897467
REPIN1 5.566773 2.708824 55.860168 5.912632e-26 3.555069e-22 47.866222
DCN -3.500038 11.313522 -54.491364 1.038499e-25 4.683109e-22 47.417177
BICC1 6.808397 3.364110 52.912578 2.023912e-25 7.301464e-22 46.878590
CXCL1 4.793578 10.249145 45.671766 5.681871e-24 1.708160e-20 44.086635
IGF2 -4.143800 11.253382 -44.791754 8.824021e-24 2.273824e-20 43.706567
ADAMTS2 -4.232043 7.541686 -44.328530 1.116296e-23 2.516969e-20 43.502538
SNCG 4.501207 7.663420 43.242174 1.956178e-23 3.920615e-20 43.012948
WNT7B -5.313021 6.723510 -42.433135 2.997517e-23 5.406921e-20 42.637883