Pydiffexp

The pydiffexp package is meant to provide an interface between R and Python to do differential expression analysis.

Imports



In [1]:

    
import pandas as pd
from pydiffexp import DEAnalysis

Load Data

Each DEAnalysis object (DEA) operates on a specific dataset. DEA uses a hierarchical dataframe (i.e. a dataframe with a multiindex) for analysis. One can either be supplied, or can be created from a dataframe with appropriate column or row labels. DEA expects the multiindex to be along the columns and will transform the data if necessary. DEA can also be initilized without data, but many methods will not work as expected.



In [2]:

    
test_path = "/Users/jfinkle/Documents/Northwestern/MoDyLS/Python/sprouty/data/raw_data/all_data_formatted.csv"
raw_data = pd.read_csv(test_path, index_col=0)

# Initialize analysis object with data. Data is retained

'''
The hierarchy provides the names for each label in the multiindex. 'condition' and 'time' are supplied as the reference
labels, which are used to make contrasts. 
''' 
hierarchy = ['condition', 'well', 'time', 'replicate']
dea = DEAnalysis(raw_data, index_names=hierarchy, reference_labels=['condition', 'time'] )

Let's look at the data that has been added to the object. Notice that the columns are a Multiindex in which the levels correspond to lists of the possible values and the names of each level come from the list supplied to index_names

Raw Data



In [3]:

    
raw_data.head()









    Out[3]:






  
    
      
      WT_1_0_A
      WT_2_0_B
      WT_3_0_C
      WT_4_15_A
      WT_5_15_B
      WT_6_15_C
      WT_7_60_A
      WT_8_60_B
      WT_9_60_C
      WT_10_120_A
      ...
      KO_22_60_A
      KO_23_60_B
      KO_24_60_C
      KO_25_120_A
      KO_26_120_B
      KO_27_120_C
      KO_28_240_A
      KO_29_240_B
      KO_30_240_C
      KO_32_0_D
    
    
      TargetID
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      0610006I08RIK
      244.6
      234.3
      272.2
      280.4
      236.0
      263.2
      224.0
      271.0
      246.6
      265.9
      ...
      142.9
      170.1
      143.9
      156.8
      169.1
      171.1
      174.1
      157.5
      162.1
      174.6
    
    
      0610007C21RIK
      1259.9
      1269.9
      1450.2
      1363.3
      1307.2
      1406.4
      1222.9
      1283.8
      1197.2
      1385.5
      ...
      1646.7
      1728.6
      1495.5
      1815.2
      1871.5
      1847.7
      1367.5
      1545.0
      1383.0
      1695.7
    
    
      0610007P08RIK
      29.0
      23.5
      26.1
      27.6
      23.5
      26.1
      22.5
      23.0
      17.7
      13.4
      ...
      18.0
      21.2
      18.7
      24.2
      53.1
      7.5
      25.0
      22.6
      16.5
      14.2
    
    
      0610007P14RIK
      607.5
      540.0
      589.6
      643.7
      724.6
      638.9
      437.6
      409.3
      545.6
      561.7
      ...
      1245.0
      1477.0
      1310.1
      1827.0
      1928.7
      1697.4
      1399.4
      1675.5
      1269.6
      1856.9
    
    
      0610007P22RIK
      529.4
      564.8
      646.0
      490.9
      553.9
      577.6
      481.7
      556.6
      449.6
      590.8
      ...
      356.2
      304.5
      319.8
      402.7
      439.3
      408.4
      450.5
      459.3
      409.4
      427.5
    
  

5 rows × 32 columns

Formatted data as Hierarchial Dataframe



In [4]:

    
dea.data.head()









    Out[4]:






  
    
      condition
      WT
      ...
      KO
    
    
      well
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      ...
      22
      23
      24
      25
      26
      27
      28
      29
      30
      32
    
    
      time
      0
      0
      0
      15
      15
      15
      60
      60
      60
      120
      ...
      60
      60
      60
      120
      120
      120
      240
      240
      240
      0
    
    
      replicate
      A
      B
      C
      A
      B
      C
      A
      B
      C
      A
      ...
      A
      B
      C
      A
      B
      C
      A
      B
      C
      D
    
    
      TargetID
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      0610006I08RIK
      244.6
      234.3
      272.2
      280.4
      236.0
      263.2
      224.0
      271.0
      246.6
      265.9
      ...
      142.9
      170.1
      143.9
      156.8
      169.1
      171.1
      174.1
      157.5
      162.1
      174.6
    
    
      0610007C21RIK
      1259.9
      1269.9
      1450.2
      1363.3
      1307.2
      1406.4
      1222.9
      1283.8
      1197.2
      1385.5
      ...
      1646.7
      1728.6
      1495.5
      1815.2
      1871.5
      1847.7
      1367.5
      1545.0
      1383.0
      1695.7
    
    
      0610007P08RIK
      29.0
      23.5
      26.1
      27.6
      23.5
      26.1
      22.5
      23.0
      17.7
      13.4
      ...
      18.0
      21.2
      18.7
      24.2
      53.1
      7.5
      25.0
      22.6
      16.5
      14.2
    
    
      0610007P14RIK
      607.5
      540.0
      589.6
      643.7
      724.6
      638.9
      437.6
      409.3
      545.6
      561.7
      ...
      1245.0
      1477.0
      1310.1
      1827.0
      1928.7
      1697.4
      1399.4
      1675.5
      1269.6
      1856.9
    
    
      0610007P22RIK
      529.4
      564.8
      646.0
      490.9
      553.9
      577.6
      481.7
      556.6
      449.6
      590.8
      ...
      356.2
      304.5
      319.8
      402.7
      439.3
      408.4
      450.5
      459.3
      409.4
      427.5
    
  

5 rows × 32 columns



In [5]:

    
dea.data.columns









    Out[5]:





MultiIndex(levels=[['KO', 'WT'], ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '4', '5', '6', '7', '8', '9'], ['0', '120', '15', '240', '60'], ['A', 'B', 'C', 'D']],
           labels=[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 11, 22, 26, 27, 28, 29, 30, 31, 1, 2, 3, 4, 5, 6, 24, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23, 25], [0, 0, 0, 2, 2, 2, 4, 4, 4, 1, 1, 1, 3, 3, 3, 0, 0, 0, 0, 2, 2, 2, 4, 4, 4, 1, 1, 1, 3, 3, 3, 0], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 3]],
           names=['condition', 'well', 'time', 'replicate'])

When the data is added, DEA automatically saves a summary of the experiment, which can also be summarized with the print function.



In [6]:

    
dea.experiment_summary



In [7]:

    
dea.print_experiment_summary()









    



conditions: 2
wells: 32
times: 5
replicates: 4
sample_ids: 10

Model Fitting

Now we're ready to fit a model! All we need to do is supply contrasts that we want to compare. These are formatted in the R style and can either be a string, list, or dictionary. Here we'll just do one contrast, so we supply a string. When the fit is run, DEA gains several new attributes that store the data, design, contrast, and fit objects created by R.

All of the model information is kept as attributes so that the entire object can be saved and the analysis can be recapitulated.



In [8]:

    
# Types of contrasts
c_dict = {'Diff0': "(KO_15-KO_0)-(WT_15-WT_0)", 'Diff15': "(KO_60-KO_15)-(WT_60-WT_15)",
          'Diff60': "(KO_120-KO_60)-(WT_120-WT_60)", 'Diff120': "(KO_240-KO_120)-(WT_240-WT_120)"}
c_list = ["KO_15-KO_0", "KO_60-KO_15", "KO_120-KO_60", "KO_240-KO_120"]
c_string = "KO_0-WT_0"
dea.fit(c_string)









    



/Volumes/Hephaestus/jfinkle/Documents/Northwestern/MoDyLS/Python/pydiffexp/pydiffexp/diffexp.py:204: UserWarning: NaNs detected during log expression transformation. Setting to NaN values to zero.
  warnings.warn("NaNs detected during log expression transformation. Setting to NaN values to zero.")



In [9]:

    
print(dea.design, '', dea.contrast_robj, '', dea.de_fit)









    



   KO_0 KO_120 KO_15 KO_240 KO_60 WT_0 WT_120 WT_15 WT_240 WT_60
1     0      0     0      0     0    1      0     0      0     0
2     0      0     0      0     0    1      0     0      0     0
3     0      0     0      0     0    1      0     0      0     0
4     0      0     0      0     0    0      0     1      0     0
5     0      0     0      0     0    0      0     1      0     0
6     0      0     0      0     0    0      0     1      0     0
7     0      0     0      0     0    0      0     0      0     1
8     0      0     0      0     0    0      0     0      0     1
9     0      0     0      0     0    0      0     0      0     1
10    0      0     0      0     0    0      1     0      0     0
11    0      0     0      0     0    0      1     0      0     0
12    0      0     0      0     0    0      1     0      0     0
13    0      0     0      0     0    0      0     0      1     0
14    0      0     0      0     0    0      0     0      1     0
15    0      0     0      0     0    0      0     0      1     0
16    0      0     0      0     0    1      0     0      0     0
17    1      0     0      0     0    0      0     0      0     0
18    1      0     0      0     0    0      0     0      0     0
19    1      0     0      0     0    0      0     0      0     0
20    0      0     1      0     0    0      0     0      0     0
21    0      0     1      0     0    0      0     0      0     0
22    0      0     1      0     0    0      0     0      0     0
23    0      0     0      0     1    0      0     0      0     0
24    0      0     0      0     1    0      0     0      0     0
25    0      0     0      0     1    0      0     0      0     0
26    0      1     0      0     0    0      0     0      0     0
27    0      1     0      0     0    0      0     0      0     0
28    0      1     0      0     0    0      0     0      0     0
29    0      0     0      1     0    0      0     0      0     0
30    0      0     0      1     0    0      0     0      0     0
31    0      0     0      1     0    0      0     0      0     0
32    1      0     0      0     0    0      0     0      0     0
attr(,"assign")
 [1] 1 1 1 1 1 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$x
[1] "contr.treatment"

          Contrasts
Levels   KO_0-WT_0
  KO_0           1
  KO_120         0
  KO_15          0
  KO_240         0
  KO_60          0
  WT_0          -1
  WT_120         0
  WT_15          0
  WT_240         0
  WT_60          0
  An object of class "MArrayLM"
$coefficients
0610006I08RIK 0610007C21RIK 0610007P08RIK 0610007P14RIK 0610007P22RIK 
   -0.6045413     0.3275503    -0.3054664     1.5865863    -0.4143036 
18133 more rows ...

$rank
[1] 10

$assign
 [1] 1 1 1 1 1 1 1 1 1 1

$qr
$qr
  KO_0    KO_120     KO_15    KO_240     KO_60 WT_0 WT_120 WT_15 WT_240 WT_60
1   -2  0.000000  0.000000  0.000000  0.000000    0      0     0      0     0
2    0 -1.732051  0.000000  0.000000  0.000000    0      0     0      0     0
3    0  0.000000 -1.732051  0.000000  0.000000    0      0     0      0     0
4    0  0.000000  0.000000 -1.732051  0.000000    0      0     0      0     0
5    0  0.000000  0.000000  0.000000 -1.732051    0      0     0      0     0
27 more rows ...

$qraux
 [1] 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
 [9] 1.000000 1.333333

$pivot
 [1]  1  2  3  4  5  6  7  8  9 10

$tol
[1] 1e-07

$rank
[1] 10


$df.residual
[1] 22 22 22 22 22
18133 more elements ...

$sigma
0610006I08RIK 0610007C21RIK 0610007P08RIK 0610007P14RIK 0610007P22RIK 
   0.10600234    0.08928556    0.52996546    0.14772904    0.14703985 
18133 more elements ...

$cov.coefficients
           Contrasts
Contrasts   KO_0-WT_0
  KO_0-WT_0       0.5

$stdev.unscaled
0610006I08RIK 0610007C21RIK 0610007P08RIK 0610007P14RIK 0610007P22RIK 
    0.7071068     0.7071068     0.7071068     0.7071068     0.7071068 
18133 more rows ...

$Amean
0610006I08RIK 0610007C21RIK 0610007P08RIK 0610007P14RIK 0610007P22RIK 
     7.658307     10.507483      4.410644      9.867572      8.864430 
18133 more elements ...

$method
[1] "ls"

$design
  KO_0 KO_120 KO_15 KO_240 KO_60 WT_0 WT_120 WT_15 WT_240 WT_60
1    0      0     0      0     0    1      0     0      0     0
2    0      0     0      0     0    1      0     0      0     0
3    0      0     0      0     0    1      0     0      0     0
4    0      0     0      0     0    0      0     1      0     0
5    0      0     0      0     0    0      0     1      0     0
27 more rows ...

$contrasts
        Contrasts
Levels   KO_0-WT_0
  KO_0           1
  KO_120         0
  KO_15          0
  KO_240         0
  KO_60          0
  WT_0          -1
  WT_120         0
  WT_15          0
  WT_240         0
  WT_60          0

$df.prior
[1] 0.8676269

$s2.prior
[1] 0.03086284

$var.prior
[1] 217.1348

$proportion
[1] 0.01

$s2.post
0610006I08RIK 0610007C21RIK 0610007P08RIK 0610007P14RIK 0610007P22RIK 
  0.011981144   0.008840421   0.271378053   0.022166820   0.021971377 
18133 more elements ...

$t
0610006I08RIK 0610007C21RIK 0610007P08RIK 0610007P14RIK 0610007P22RIK 
   -7.8107330     4.9267042    -0.8292607    15.0704772    -3.9528017 
18133 more rows ...

$df.total
[1] 22.86763 22.86763 22.86763 22.86763 22.86763
18133 more elements ...

$p.value
0610006I08RIK 0610007C21RIK 0610007P08RIK 0610007P14RIK 0610007P22RIK 
 6.714101e-08  5.680062e-05  4.155296e-01  2.293052e-13  6.379779e-04 
18133 more rows ...

$lods
0610006I08RIK 0610007C21RIK 0610007P08RIK 0610007P14RIK 0610007P22RIK 
    7.8032489     0.9707584    -7.2803447    20.6395921    -1.4374648 
18133 more rows ...

$F
[1]  61.0075505  24.2724140   0.6876732 227.1192845  15.6246415
18133 more elements ...

$F.p.value
[1] 6.714101e-08 5.680062e-05 4.155296e-01 2.293052e-13 6.379779e-04
18133 more elements ...

After the fit, we want to see our significant results. DEA calls topTable so all keywoard arguments from the R function can be passed, though the defaults explicitly in get_results() are the most commonly used ones. If more than one contrast is supplied, pydiffexp will default to using the F statistic when selecting significant genes.



In [10]:

    
dea.get_results(p_value=0.01, n=10)









    Out[10]:






  
    
      
      logFC
      AveExpr
      t
      P.Value
      adj.P.Val
      B
    
  
  
    
      PTN
      9.494964
      4.728120
      153.893059
      5.464559e-36
      9.856972e-32
      60.700807
    
    
      IGFBP7
      7.095179
      8.116964
      59.201657
      1.579788e-26
      1.424811e-22
      48.897467
    
    
      REPIN1
      5.566773
      2.708824
      55.860168
      5.912632e-26
      3.555069e-22
      47.866222
    
    
      DCN
      -3.500038
      11.313522
      -54.491364
      1.038499e-25
      4.683109e-22
      47.417177
    
    
      BICC1
      6.808397
      3.364110
      52.912578
      2.023912e-25
      7.301464e-22
      46.878590
    
    
      CXCL1
      4.793578
      10.249145
      45.671766
      5.681871e-24
      1.708160e-20
      44.086635
    
    
      IGF2
      -4.143800
      11.253382
      -44.791754
      8.824021e-24
      2.273824e-20
      43.706567
    
    
      ADAMTS2
      -4.232043
      7.541686
      -44.328530
      1.116296e-23
      2.516969e-20
      43.502538
    
    
      SNCG
      4.501207
      7.663420
      43.242174
      1.956178e-23
      3.920615e-20
      43.012948
    
    
      WNT7B
      -5.313021
      6.723510
      -42.433135
      2.997517e-23
      5.406921e-20
      42.637883

	WT_1_0_A	WT_2_0_B	WT_3_0_C	WT_4_15_A	WT_5_15_B	WT_6_15_C	WT_7_60_A	WT_8_60_B	WT_9_60_C	WT_10_120_A	...	KO_22_60_A	KO_23_60_B	KO_24_60_C	KO_25_120_A	KO_26_120_B	KO_27_120_C	KO_28_240_A	KO_29_240_B	KO_30_240_C	KO_32_0_D
TargetID
0610006I08RIK	244.6	234.3	272.2	280.4	236.0	263.2	224.0	271.0	246.6	265.9	...	142.9	170.1	143.9	156.8	169.1	171.1	174.1	157.5	162.1	174.6
0610007C21RIK	1259.9	1269.9	1450.2	1363.3	1307.2	1406.4	1222.9	1283.8	1197.2	1385.5	...	1646.7	1728.6	1495.5	1815.2	1871.5	1847.7	1367.5	1545.0	1383.0	1695.7
0610007P08RIK	29.0	23.5	26.1	27.6	23.5	26.1	22.5	23.0	17.7	13.4	...	18.0	21.2	18.7	24.2	53.1	7.5	25.0	22.6	16.5	14.2
0610007P14RIK	607.5	540.0	589.6	643.7	724.6	638.9	437.6	409.3	545.6	561.7	...	1245.0	1477.0	1310.1	1827.0	1928.7	1697.4	1399.4	1675.5	1269.6	1856.9
0610007P22RIK	529.4	564.8	646.0	490.9	553.9	577.6	481.7	556.6	449.6	590.8	...	356.2	304.5	319.8	402.7	439.3	408.4	450.5	459.3	409.4	427.5

condition	WT										...	KO
well	1	2	3	4	5	6	7	8	9	10	...	22	23	24	25	26	27	28	29	30	32
time	0	0	0	15	15	15	60	60	60	120	...	60	60	60	120	120	120	240	240	240	0
replicate	A	B	C	A	B	C	A	B	C	A	...	A	B	C	A	B	C	A	B	C	D
TargetID
0610006I08RIK	244.6	234.3	272.2	280.4	236.0	263.2	224.0	271.0	246.6	265.9	...	142.9	170.1	143.9	156.8	169.1	171.1	174.1	157.5	162.1	174.6
0610007C21RIK	1259.9	1269.9	1450.2	1363.3	1307.2	1406.4	1222.9	1283.8	1197.2	1385.5	...	1646.7	1728.6	1495.5	1815.2	1871.5	1847.7	1367.5	1545.0	1383.0	1695.7
0610007P08RIK	29.0	23.5	26.1	27.6	23.5	26.1	22.5	23.0	17.7	13.4	...	18.0	21.2	18.7	24.2	53.1	7.5	25.0	22.6	16.5	14.2
0610007P14RIK	607.5	540.0	589.6	643.7	724.6	638.9	437.6	409.3	545.6	561.7	...	1245.0	1477.0	1310.1	1827.0	1928.7	1697.4	1399.4	1675.5	1269.6	1856.9
0610007P22RIK	529.4	564.8	646.0	490.9	553.9	577.6	481.7	556.6	449.6	590.8	...	356.2	304.5	319.8	402.7	439.3	408.4	450.5	459.3	409.4	427.5

	condition	well	time	replicate	sample_id
0	WT	1	0	A	5
1	WT	2	0	B	5
2	WT	3	0	C	5
3	WT	4	15	A	7
4	WT	5	15	B	7
5	WT	6	15	C	7
6	WT	7	60	A	9
7	WT	8	60	B	9
8	WT	9	60	C	9
9	WT	10	120	A	6
10	WT	11	120	B	6
11	WT	12	120	C	6
12	WT	13	240	A	8
13	WT	14	240	B	8
14	WT	15	240	C	8
15	WT	31	0	D	5
16	KO	16	0	A	0
17	KO	17	0	B	0
18	KO	18	0	C	0
19	KO	19	15	A	2
20	KO	20	15	B	2
21	KO	21	15	C	2
22	KO	22	60	A	4
23	KO	23	60	B	4
24	KO	24	60	C	4
25	KO	25	120	A	1
26	KO	26	120	B	1
27	KO	27	120	C	1
28	KO	28	240	A	3
29	KO	29	240	B	3
30	KO	30	240	C	3
31	KO	32	0	D	0

	logFC	AveExpr	t	P.Value	adj.P.Val	B
PTN	9.494964	4.728120	153.893059	5.464559e-36	9.856972e-32	60.700807
IGFBP7	7.095179	8.116964	59.201657	1.579788e-26	1.424811e-22	48.897467
REPIN1	5.566773	2.708824	55.860168	5.912632e-26	3.555069e-22	47.866222
DCN	-3.500038	11.313522	-54.491364	1.038499e-25	4.683109e-22	47.417177
BICC1	6.808397	3.364110	52.912578	2.023912e-25	7.301464e-22	46.878590
CXCL1	4.793578	10.249145	45.671766	5.681871e-24	1.708160e-20	44.086635
IGF2	-4.143800	11.253382	-44.791754	8.824021e-24	2.273824e-20	43.706567
ADAMTS2	-4.232043	7.541686	-44.328530	1.116296e-23	2.516969e-20	43.502538
SNCG	4.501207	7.663420	43.242174	1.956178e-23	3.920615e-20	43.012948
WNT7B	-5.313021	6.723510	-42.433135	2.997517e-23	5.406921e-20	42.637883

	condition	well	time	replicate	sample_id
0	WT	1	0	A	5
1	WT	2	0	B	5
2	WT	3	0	C	5
3	WT	4	15	A	7
4	WT	5	15	B	7
5	WT	6	15	C	7
6	WT	7	60	A	9
7	WT	8	60	B	9
8	WT	9	60	C	9
9	WT	10	120	A	6
10	WT	11	120	B	6
11	WT	12	120	C	6
12	WT	13	240	A	8
13	WT	14	240	B	8
14	WT	15	240	C	8
15	WT	31	0	D	5
16	KO	16	0	A	0
17	KO	17	0	B	0
18	KO	18	0	C	0
19	KO	19	15	A	2
20	KO	20	15	B	2
21	KO	21	15	C	2
22	KO	22	60	A	4
23	KO	23	60	B	4
24	KO	24	60	C	4
25	KO	25	120	A	1
26	KO	26	120	B	1
27	KO	27	120	C	1
28	KO	28	240	A	3
29	KO	29	240	B	3
30	KO	30	240	C	3
31	KO	32	0	D	0

	condition	well	time	replicate	sample_id
0	WT	1	0	A	5
1	WT	2	0	B	5
2	WT	3	0	C	5
3	WT	4	15	A	7
4	WT	5	15	B	7
5	WT	6	15	C	7
6	WT	7	60	A	9
7	WT	8	60	B	9
8	WT	9	60	C	9
9	WT	10	120	A	6
10	WT	11	120	B	6
11	WT	12	120	C	6
12	WT	13	240	A	8
13	WT	14	240	B	8
14	WT	15	240	C	8
15	WT	31	0	D	5
16	KO	16	0	A	0
17	KO	17	0	B	0
18	KO	18	0	C	0
19	KO	19	15	A	2
20	KO	20	15	B	2
21	KO	21	15	C	2
22	KO	22	60	A	4
23	KO	23	60	B	4
24	KO	24	60	C	4
25	KO	25	120	A	1
26	KO	26	120	B	1
27	KO	27	120	C	1
28	KO	28	240	A	3
29	KO	29	240	B	3
30	KO	30	240	C	3
31	KO	32	0	D	0