GSEAPY Example

Examples to use GSEApy inside python console


In [1]:
%matplotlib inline
%config InlineBackend.figure_format='retina' # mac
import pandas as pd
import gseapy as gp
import matplotlib.pyplot as plt

Check gseapy version


In [2]:
gp.__version__


Out[2]:
'0.9.13'

1. Convert IDs Using Biomart API

See validated fiters, attributes, datasets

>>> from gseapy.parser import Biomart 
    >>> bm = Biomart(verbose=False, host="asia.ensembl.org")
    >>> ## view validated marts
    >>> marts = bm.get_marts()
    >>> ## view validated dataset
    >>> datasets = bm.get_datasets(mart='ENSEMBL_MART_ENSEMBL')
    >>> ## view validated attributes
    >>> attrs = bm.get_attributes(dataset='hsapiens_gene_ensembl') 
    >>> ## view validated filters
    >>> filters = bm.get_filters(dataset='hsapiens_gene_ensembl')
    >>> ## query results
    >>> results = bm.query(dataset='hsapiens_gene_ensembl', attributes=['entrezgene', 'go_id'],
                          filters={'ensemble_gene_id': [your_input_list]})

In [3]:
# read in an example gene list
gene_list = pd.read_csv("./data/gene_list.txt",header=None, sep="\t")
gene_list.head()


Out[3]:
0
0 IGKV4-1
1 CD55
2 IGKC
3 PPFIBP1
4 ABHD4

In [4]:
# convert dataframe or series to list
glist = gene_list.squeeze().str.strip().tolist()
print(glist[:10])


['IGKV4-1', 'CD55', 'IGKC', 'PPFIBP1', 'ABHD4', 'PCSK6', 'PGD', 'ARHGDIB', 'ITGB2', 'CARD6']

In [5]:
from gseapy.parser import Biomart

In [6]:
bm = Biomart()

In [7]:
## BioMart is slow. Be patient
results = bm.query(dataset='hsapiens_gene_ensembl', 
                   attributes=['external_gene_name','entrezgene', 'go_id'],
                   filters={'hgnc_symbol': glist}, 
                   # save output file 
                   filename="query.results.txt")

In [8]:
# note: pandas read entrez ids as floats, you need to convert them to int 
results.head()


Out[8]:
external_gene_name entrezgene go_id
0 ABAT 18.0 GO:0003824
1 ABAT 18.0 GO:0030170
2 ABAT 18.0 GO:0008483
3 ABAT 18.0 GO:0003867
4 ABAT 18.0 GO:0009448

2. Enrichr Example

See all supported enrichr library names

Select database from { 'Human', 'Mouse', 'Yeast', 'Fly', 'Fish', 'Worm' }

Enrichr library could be used for gsea, ssgsea, and prerank, too


In [9]:
names = gp.get_library_name() # default: Human
names[:10]


Out[9]:
['ARCHS4_Cell-lines',
 'ARCHS4_IDG_Coexp',
 'ARCHS4_Kinases_Coexp',
 'ARCHS4_TFs_Coexp',
 'ARCHS4_Tissues',
 'Achilles_fitness_decrease',
 'Achilles_fitness_increase',
 'Aging_Perturbations_from_GEO_down',
 'Aging_Perturbations_from_GEO_up',
 'Allen_Brain_Atlas_down']

In [10]:
yeast = gp.get_library_name(database='Yeast') 
yeast[:10]


Out[10]:
['Cellular_Component_AutoRIF',
 'Cellular_Component_AutoRIF_Predicted_zscore',
 'GO_Biological_Process_2018',
 'GO_Biological_Process_AutoRIF',
 'GO_Biological_Process_AutoRIF_Predicted_zscore',
 'GO_Cellular_Component_2018',
 'GO_Cellular_Component_AutoRIF',
 'GO_Cellular_Component_AutoRIF_Predicted_zscore',
 'GO_Molecular_Function_2018',
 'GO_Molecular_Function_AutoRIF']

2.1 Assign enrichr with pd.Series, pd.DataFrame, or list object

2.1.1 gene_sets support list, str.

Multi-libraries names supported, separate each name by comma or input a list.

For example:

# gene_list
    gene_list="./data/gene_list.txt", 
    gene_list=glist
    # gene_sets
    gene_sets='KEGG_2016'  
    gene_sets='KEGG_2016,KEGG_2013'
    gene_sets=['KEGG_2016','KEGG_2013']

In [11]:
# run enrichr
# if you are only intrested in dataframe that enrichr returned, please set no_plot=True

# list, dataframe, series inputs are supported
enr = gp.enrichr(gene_list="./data/gene_list.txt", 
                 # or gene_list=glist
                 description='test_name', 
                 gene_sets=['KEGG_2016','KEGG_2013'], 
                 outdir='test/enrichr_kegg', 
                 cutoff=0.5 # test dataset, use lower value from range(0,1)
                )

In [12]:
# access results through res2d attr
# obj.res2d only stores the last enrichr result in your input libraries
# obj.results stores all results
enr.results.head(5)


Out[12]:
Adjusted P-value Combined Score Gene_set Genes Old Adjusted P-value Old P-value Overlap P-value Term Z-score
0 7.888667e-11 53.688178 KEGG_2016 LILRA6;ITGB3;LILRA2;LILRA5;PPP3R1;FCGR3B;SIRPA... 2.292723e-07 9.026468e-10 28/132 3.105774e-13 Osteoclast differentiation_Homo sapiens_hsa04380 -1.864151
1 5.448332e-10 46.588613 KEGG_2016 RAB5B;ITGB2;PPP3R1;HLA-DMA;HLA-DMB;FCGR3B;CASP... 9.274793e-07 7.302987e-09 31/178 4.290025e-12 Tuberculosis_Homo sapiens_hsa05152 -1.779908
2 1.366949e-09 41.214733 KEGG_2016 ATP6V1A;RAB5B;ITGB5;ITGB3;ITGB2;HLA-DMA;HLA-DM... 1.513502e-06 1.787601e-08 28/154 1.614506e-11 Phagosome_Homo sapiens_hsa04145 -1.658580
3 1.591672e-07 34.154775 KEGG_2016 TGFB1;PRKCB;IFNGR1;IFNGR2;ITGB2;FOS;MAPK14;HLA... 3.321891e-05 6.539155e-07 17/73 3.133213e-09 Leishmaniasis_Homo sapiens_hsa05140 -1.744263
4 1.395934e-07 33.667700 KEGG_2016 ATP6V1A;ATP6V1G1;ATP6V0B;TGFB1;ITGB2;FOS;ITGAL... 3.321891e-05 5.292880e-07 19/90 2.198322e-09 Rheumatoid arthritis_Homo sapiens_hsa05323 -1.688825

2.1.2 Local mode of GO analysis

If input a .gmt file or gene_set dict object, enrichr runs local.
You have to specify the background genes, if local mode used

For example:

gene_sets="./data/genes.gmt",
    gene_sets={'A':['gene1', 'gene2',...],
               'B':['gene2', 'gene4',...],
               ...}

In [13]:
enr2 = gp.enrichr(gene_list="./data/gene_list.txt", 
                 # or gene_list=glist
                 description='test_name', 
                 gene_sets="./data/genes.gmt",
                 background='hsapiens_gene_ensembl', # or the number of genes, e.g 20000
                 outdir='test/enrichr_kegg2', 
                 cutoff=0.5, # only used for testing.
                 verbose=True)


2019-05-26 12:12:34,822 User Defined gene sets is given: ./data/genes.gmt
2019-05-26 12:12:34,824 Connecting to Enrichr Server to get latest library names
2019-05-26 12:12:34,849 using all annotated genes with GO_ID as background genes
2019-05-26 12:12:34,858 Background: found 19041 genes
2019-05-26 12:12:34,866 Save file of enrichment results: Job Id:140275387977160
2019-05-26 12:12:35,104 Done.


In [14]:
enr2.results.head(5)


Out[14]:
Adjusted P-value Gene_set Genes Overlap P-value Term
0 0.581624 CUSTOM140275387977160 PADI2;HAL;MSRB2;IL1R1;MAP3K5;PCSK6;MBOAT2;IQGAP2 8/139 0.287984 BvA_UpIN_A
1 0.227727 CUSTOM140275387977160 DYSF;SUOX;ST3GAL6;KCTD12;HEBP1;ARHGDIB;SYK;MBN... 11/130 0.032532 BvA_UpIN_B
2 0.581624 CUSTOM140275387977160 MBOAT2 1/12 0.424279 CvA_UpIN_A
3 0.581624 CUSTOM140275387977160 PADI2;HAL;ATP6V1B2;IL1R1;IFNGR2;FXYD6;NMNAT1;A... 16/284 0.210293 DvA_UpIN_A
4 0.581624 CUSTOM140275387977160 DYSF;SIRPA;ST3GAL6;HEBP1;GNB4;FAM198B;TXNDC5;M... 12/236 0.372799 DvA_UpIN_D

2.1.3 Plotting


In [15]:
# simple plotting function
from gseapy.plot import barplot, dotplot

# to save your figure, make sure that ``ofname`` is not None
barplot(enr.res2d,title='KEGG_2013',)


Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9469b44358>

In [16]:
# to save your figure, make sure that ``ofname`` is not None
dotplot(enr.res2d, title='KEGG_2013',)


Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9468a95b00>

2.2 Command line usage

You may also want to use enrichr in command line

the option -v will print out the progress of your job


In [17]:
# !gseapy enrichr -i ./data/gene_list.txt \
#                 --ds BP2017 \
#                 -g GO_Biological_Process_2017 \
#                 -v -o test/enrichr_BP

3. Prerank example

3.1 Assign prerank() with a pd.DataFrame, pd.Series , or a txt file

Do not include header in your gene list !
GSEApy will skip any data after "#".
Only contains two columns, or one cloumn with gene_name indexed when assign a DataFrame to prerank


In [18]:
rnk = pd.read_csv("./data/edb/gsea_data.gsea_data.rnk", header=None, sep="\t")
rnk.head()


Out[18]:
0 1
0 CTLA2B 2.502482
1 SCARA3 2.095578
2 LOC100044683 1.116398
3 CMBL 0.877640
4 CLIC6 0.822181

In [19]:
# run prerank
# enrichr libraries are supported by prerank module. Just provide the name
# use 4 process to acceralate the permutation speed

# note: multiprocessing may not work on windows
pre_res = gp.prerank(rnk=rnk, gene_sets='KEGG_2016', 
                     processes=4,
                     permutation_num=100, # reduce number to speed up testing
                     outdir='test/prerank_report_kegg', format='png')

Leading edge genes save to the final output results


In [20]:
#access results through obj.res2d attribute or obj.results
pre_res.res2d.head()


Out[20]:
es nes pval fdr geneset_size matched_size genes ledge_genes
Term
Cytokine-cytokine receptor interaction_Homo sapiens_hsa04060 0.418234 2.588945 0.107143 0.210526 265 18 IL13RA1;CSF1;CCL2;TGFBR2;CD40;IL10RB;CXCL10;CX... IL13RA1;CSF1;CCL2;TGFBR2;CD40;IL10RB;CXCL10
Pathways in cancer_Homo sapiens_hsa05200 -0.188934 -3.526497 0.500000 0.284247 397 27 TGFBR2;LAMB3;RET;PDGFRB;ADCY6;ETS1;TGFA;WNT9A;... VEGFC;LPAR4;LEF1;RUNX1T1
Rap1 signaling pathway_Homo sapiens_hsa04015 0.221306 1.346911 0.615385 0.752193 211 19 CSF1;EFNA1;ARAP3;PDGFRB;ADCY6;PRKD1;MAPK13;FLT... CSF1;EFNA1
PI3K-Akt signaling pathway_Homo sapiens_hsa04151 0.191372 1.149349 0.757576 0.813910 341 22 CSF1;EFNA1;COL6A1;THBS4;LAMB3;PDGFRB;GNG13;FLT... CSF1;EFNA1;COL6A1;THBS4;LAMB3
HTLV-I infection_Homo sapiens_hsa05166 0.338286 1.842641 0.146667 0.848684 258 19 CRTC3;TGFBR2;CD40;PDGFRB;ADCY6;PPP3CC;ETS1;WNT... CRTC3;TGFBR2;CD40;PDGFRB;ADCY6;PPP3CC;ETS1;WNT...

3.2 How to generate your GSEA plot inside python console

Visualize it using gseaplot

Make sure that ofname is not None, if you want to save your figure to the disk


In [21]:
# extract geneset terms in res2d
terms = pre_res.res2d.index
terms


Out[21]:
Index(['Cytokine-cytokine receptor interaction_Homo sapiens_hsa04060',
       'Pathways in cancer_Homo sapiens_hsa05200',
       'Rap1 signaling pathway_Homo sapiens_hsa04015',
       'PI3K-Akt signaling pathway_Homo sapiens_hsa04151',
       'HTLV-I infection_Homo sapiens_hsa05166',
       'MAPK signaling pathway_Homo sapiens_hsa04010',
       'Focal adhesion_Homo sapiens_hsa04510',
       'Metabolic pathways_Homo sapiens_hsa01100',
       'Ras signaling pathway_Homo sapiens_hsa04014'],
      dtype='object', name='Term')

In [22]:
## easy way
from gseapy.plot import gseaplot

# to save your figure, make sure that ofname is not None
gseaplot(rank_metric=pre_res.ranking, term=terms[0], **pre_res.results[terms[0]])

# save figure
# gseaplot(rank_metric=pre_res.ranking, term=terms[0], ofname='your.plot.pdf', **pre_res.results[terms[0]])


3) Command line usage

You may also want to use prerank in command line


In [23]:
# ! gseapy prerank -r temp.rnk -g temp.gmt -o prerank_report_temp

4. GSEA Example

4.1 Assign gsea() with a pandas DataFrame, .gct format file, or a text file

and cls with a list object or just .cls format file


In [24]:
phenoA, phenoB, class_vector =  gp.parser.gsea_cls_parser("./data/P53.cls")

In [25]:
#class_vector used to indicate group attributes for each sample
print(class_vector)


['MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'WT', 'WT', 'WT', 'WT', 'WT', 'WT', 'WT', 'WT', 'WT', 'WT', 'WT', 'WT', 'WT', 'WT', 'WT', 'WT', 'WT']

In [26]:
gene_exp = pd.read_csv("./data/P53.txt", sep="\t")
gene_exp.head()


Out[26]:
NAME DESCRIPTION 786-0 BT-549 CCRF-CEM COLO 205 EKVX HCC-2998 HCT-15 HOP-62 ... MCF7 MOLT-4 NCI-H460 OVCAR-4 SF-539 SK-MEL-5 SR UACC-257 UACC-62 UO-31
0 TACC2 na 46.05 82.17 16.87 98.60 141.02 114.32 134.34 44.95 ... 68.14 32.21 105.89 64.99 53.52 85.47 18.69 32.16 45.70 48.13
1 C14orf132 na 108.34 59.04 25.61 33.11 42.53 9.12 9.36 310.96 ... 159.32 10.71 13.59 53.78 57.57 86.80 17.30 102.66 62.16 73.44
2 AGER na 42.20 25.75 76.01 40.41 32.17 48.28 58.27 42.40 ... 51.50 61.48 44.44 45.68 54.17 62.53 83.18 56.57 50.40 36.75
3 32385_at na 7.43 13.94 8.55 21.13 15.09 19.05 16.47 7.60 ... 30.77 21.27 13.36 16.19 12.07 17.62 22.60 4.50 14.59 11.33
4 RBM17 na 11.40 3.00 3.16 2.34 4.43 1.56 6.04 6.16 ... 1.62 2.77 4.42 8.91 12.28 3.04 10.13 8.32 8.23 3.91

5 rows × 52 columns


In [27]:
print("positively correlated: ", phenoA)


positively correlated:  MUT

In [28]:
print("negtively correlated: ", phenoB)


negtively correlated:  WT

In [29]:
# run gsea
# enrichr libraries are supported by gsea module. Just provide the name

gs_res = gp.gsea(data=gene_exp, # or data='./P53_resampling_data.txt'
                 gene_sets='KEGG_2016', # enrichr library names
                 cls= './data/P53.cls', # cls=class_vector
                 # set permutation_type to phenotype if samples >=15
                 permutation_type='phenotype', 
                 permutation_num=100, # reduce number to speed up test
                 outdir=None,  # do not write output to disk
                 no_plot=True, # Skip plotting
                 method='signal_to_noise',
                 processes=4,
                 format='png')

In [30]:
#access the dataframe results throught res2d attribute
gs_res.res2d.head()


Out[30]:
es nes pval fdr geneset_size matched_size genes ledge_genes
Term
N-Glycan biosynthesis_Homo sapiens_hsa00510 0.538433 3.600101 0.052174 0.223501 49 24 DPM2;B4GALT2;GANAB;FUT8;DPM1;MAN1A1;TUSC3;ALG8... DPM2;B4GALT2;GANAB;FUT8;DPM1;MAN1A1;TUSC3;ALG8...
Calcium signaling pathway_Homo sapiens_hsa04020 -0.376305 -4.227417 0.000000 0.333550 180 144 ADORA2B;PLCE1;PHKA2;GNA11;PPP3CB;PTGFR;ERBB2;C... GRIN2C;HTR2A;ITPR2;ADCY2;GRIN2A;TBXA2R;SLC8A1;...
Fat digestion and absorption_Homo sapiens_hsa04975 -0.521341 -4.005290 0.028302 0.391059 41 25 DGAT1;PPAP2A;PPAP2C;PPAP2B;ACAT2;AGPAT2;GOT2;F... APOA1;PLA2G1B;SCARB1;CEL;PNLIPRP1;APOA4;APOB;A...
Rheumatoid arthritis_Homo sapiens_hsa05323 -0.421741 -3.304221 0.044248 0.434478 90 67 IL18;MMP1;MMP3;FOS;CCL2;TCIRG1;TNFRSF11A;HLA-D... ATP6V0A2;ATP6V1E1;ATP6V1G1;ACP5;ATP6V0D1;HLA-D...
Neuroactive ligand-receptor interaction_Homo sapiens_hsa04080 -0.400530 -3.286234 0.060345 0.435423 277 185 ADORA2B;PRSS2;PTGFR;CNR1;DRD4;P2RY2;PRSS3;GRIA... GRIN2C;HTR2A;GRIN2A;GRM4;C3AR1;SSTR5;THRB;TBXA...

4.2 Show the gsea plots

The gsea module will generate heatmap for genes in each gene sets in the backgroud.
But if you need to do it yourself, use the code below


In [31]:
from gseapy.plot import gseaplot, heatmap
terms = gs_res.res2d.index
# Make sure that ``ofname`` is not None, if you want to save your figure to disk
gseaplot(gs_res.ranking, term=terms[0], **gs_res.results[terms[0]])



In [32]:
# plotting heatmap
genes = gs_res.res2d.genes[0].split(";")
# Make sure that ``ofname`` is not None, if you want to save your figure to disk
heatmap(df = gs_res.heatmat.loc[genes], z_score=0, title=terms[0], figsize=(18,6))


4.3 Command line usage

You may also want to use gsea in command line


In [33]:
# !gseapy gsea -d ./data/P53_resampling_data.txt \
#              -g KEGG_2016 -c ./data/P53.cls \
#              -o test/gsea_reprot_2 \
#              -v --no-plot \
#              -t phenotype

5. Single Sample GSEA example

Note: When you run ssGSEA, all genes names in your gene_sets file should be found in your expression table

5.1 Input format

Assign ssgsea() with a txt file, gct file, pd.DataFrame, or pd.Seires(gene name as index)


In [34]:
# txt, gct file input
ss = gp.ssgsea(data="./data/testSet_rand1200.gct",
               gene_sets="./data/randomSets.gmt", 
               outdir='test/ssgsea_report', 
               sample_norm_method='rank', # choose 'custom' for your own rank list
               permutation_num=0, # skip permutation procedure, because you don't need it
               no_plot=True, # skip plotting, because you don't need these figures
               processes=4, format='png')

In [35]:
ss.res2d


Out[35]:
AA488_A1.2 AA489_A2.2 AA490_A3 AA491_B1 AA492_B2 AA493_B3 AA494_C1.2 AA495_C2 AA496_C3 AA497_D1.2 AA498_D3.2 AA499_D2 AA500_x2 AA501_X3 AA502_X2.2 AA503_Y1 AA504_Y2 AA505_Y3
Term|NES
level10_RAND 0.409675 0.412178 0.402321 0.420735 0.423723 0.399593 0.417449 0.426562 0.414945 0.427697 0.428861 0.439795 0.439169 0.429886 0.421071 0.433796 0.436511 0.427593
level12_random 0.623442 0.637662 0.624727 0.636428 0.644896 0.628848 0.645327 0.633330 0.637862 0.647586 0.648293 0.638572 0.652058 0.650088 0.653153 0.660680 0.659886 0.648956
level2_rand -0.271914 -0.281744 -0.271358 -0.277807 -0.279207 -0.260003 -0.294843 -0.289941 -0.286516 -0.294068 -0.285716 -0.286194 -0.335032 -0.331248 -0.337686 -0.336926 -0.334273 -0.339320
level4_rand -0.061550 -0.090115 -0.063876 -0.094134 -0.094397 -0.053017 -0.096070 -0.081046 -0.080501 -0.092273 -0.092036 -0.078247 -0.119548 -0.120003 -0.129959 -0.143212 -0.138202 -0.126784
level6_rand -0.061075 -0.054566 -0.061283 -0.052871 -0.054915 -0.047231 -0.059001 -0.069959 -0.061611 -0.055635 -0.058775 -0.062907 -0.020677 -0.017325 -0.011440 -0.018986 -0.017288 -0.009031
level8_rand 0.145081 0.157404 0.165249 0.160468 0.141703 0.172145 0.139912 0.173172 0.153195 0.146414 0.127259 0.165137 0.175863 0.166732 0.175394 0.166237 0.169276 0.192095
random1 0.154009 0.151312 0.151478 0.147901 0.151960 0.140935 0.154359 0.157054 0.152341 0.150122 0.147418 0.157688 0.125304 0.117738 0.109891 0.113501 0.115591 0.110315
random2 0.114789 0.111621 0.101720 0.112612 0.113221 0.106218 0.120733 0.131228 0.113305 0.125728 0.118026 0.139158 0.090622 0.084933 0.083598 0.080412 0.087921 0.100607
random3 0.090674 0.103528 0.101386 0.099922 0.101013 0.098896 0.095853 0.080370 0.100862 0.089775 0.098816 0.088680 0.098413 0.107809 0.099561 0.103815 0.105186 0.098073
random4 0.085871 0.080253 0.092358 0.074533 0.071283 0.079997 0.077435 0.065426 0.081537 0.063962 0.066237 0.062175 0.063953 0.061089 0.055425 0.052380 0.053036 0.069909
random5 0.063539 0.063641 0.057671 0.064427 0.063732 0.047877 0.068045 0.080220 0.060985 0.072431 0.061967 0.075577 0.107136 0.094497 0.099291 0.096348 0.100260 0.098746

In [36]:
from gseapy.plot import heatmap
heatmap(ss.res2d)



In [37]:
# or assign a dataframe, or Series to ssgsea()
ssdf = pd.read_csv("./data/temp.txt", header=None, sep="\t")
ssdf.head()


Out[37]:
0 1
0 ATXN1 16.456753
1 UBQLN4 13.989493
2 CALM1 13.745533
3 DLG4 12.796588
4 MRE11A 12.787631

In [38]:
# dataframe with one column is also supported by ssGSEA or Prerank
# But you have to set gene_names as index
ssdf2 = ssdf.set_index(0)
ssdf2.head()


Out[38]:
1
0
ATXN1 16.456753
UBQLN4 13.989493
CALM1 13.745533
DLG4 12.796588
MRE11A 12.787631

In [39]:
type(ssdf2)


Out[39]:
pandas.core.frame.DataFrame

In [40]:
ssSeries = ssdf2.squeeze()
type(ssSeries)


Out[40]:
pandas.core.series.Series

In [41]:
# reuse data
df = pd.read_csv("./data/P53_resampling_data.txt", sep="\t")
df.head()


Out[41]:
NAME 786-0 BT-549 CCRF-CEM COLO 205 EKVX HCC-2998 HCT-15 HOP-62 HOP-92 ... MCF7 MOLT-4 NCI-H460 OVCAR-4 SF-539 SK-MEL-5 SR UACC-257 UACC-62 UO-31
0 CTLA2B 111.19 86.22 121.85 75.19 208.62 130.59 124.72 324.09 242.71 ... 163.76 59.50 134.12 152.09 197.46 137.79 81.53 123.37 81.41 180.78
1 SCARA3 460.30 558.34 183.55 37.29 158.00 43.61 80.83 300.08 1250.25 ... 109.91 120.42 73.06 115.03 95.12 37.56 76.16 41.10 77.51 519.17
2 LOC100044683 97.25 118.94 81.17 119.51 119.88 107.73 165.57 203.97 135.43 ... 222.84 124.98 114.75 141.66 170.19 147.70 157.48 152.18 98.89 118.06
3 CMBL 33.45 55.10 221.67 50.30 35.12 75.70 84.01 44.12 79.96 ... 51.32 117.11 59.46 78.46 45.55 49.07 96.69 33.09 10.38 52.89
4 CLIC6 35.75 41.26 63.04 219.86 42.53 54.19 86.98 71.20 53.89 ... 154.05 31.62 37.66 32.64 63.35 27.95 70.99 36.25 17.50 49.41

5 rows × 51 columns


In [42]:
# Series, DataFrame Example
# supports dataframe and series
ssgs = []
for i, dat in enumerate([ssdf, ssdf2, ssSeries, df]):
    sstemp = gp.ssgsea(data=dat, 
                       gene_sets="./data/genes.gmt", 
                       outdir='test/ssgsea_report_'+str(i),
                       scale=False, # set scale to False to get real original ES
                       permutation_num=0, # skip permutation procedure, because you don't need it
                       no_plot=True, # skip plotting, because you don't need these figures
                       processes=4, 
                       format='png')
    ssgs.append(sstemp)


2019-05-26 12:12:57,033 Warning: dropping duplicated gene names, only keep the first values

5.2 Access Enrichment Score (ES) and NES

results save to two attribute:

  1. obj.resultsOnSamples: ES
  2. obj.res2d: NES

In [43]:
# normalized es save to res2d attri
# one sample input
# NES
ssgs[0].res2d.head()


Out[43]:
1
Term|NES
BvA_UpIN_A 2.150114
BvA_UpIN_B 2.953848
DvA_UpIN_A 1.985451
DvA_UpIN_D 2.457489
YvX_UpIN_X 2.148816

Note: If you want to obtain the real original enrichment score,
you have to set scale=False


In [44]:
# ES
# convert dict to DataFrame
es = pd.DataFrame(ssgs[-1].resultsOnSamples)
es.head()


Out[44]:
786-0 BT-549 CCRF-CEM COLO 205 EKVX HCC-2998 HCT-15 HOP-62 HOP-92 HS 578T ... MCF7 MOLT-4 NCI-H460 OVCAR-4 SF-539 SK-MEL-5 SR UACC-257 UACC-62 UO-31
Term|ES
BvA_UpIN_B 19.826225 68.077601 127.025106 53.618132 51.205653 21.285201 79.380375 51.496277 38.784929 64.373380 ... 49.903306 100.444246 59.520973 59.090405 61.839421 37.801029 61.567375 24.239317 40.271419 46.221578
DvA_UpIN_A 37.106732 16.102699 10.056191 16.232475 32.951477 8.473519 -0.549080 21.782343 21.595246 40.755753 ... 8.977915 11.086783 42.961615 49.893513 22.867080 30.443848 11.434225 53.935011 60.430223 44.753762
DvA_UpIN_D 71.124111 84.634416 77.987009 55.918679 72.498150 60.698712 42.545929 59.307949 72.721958 81.539346 ... 68.626778 71.937189 89.658150 95.724196 76.479274 68.035031 59.802289 78.330877 73.898535 76.820769

3 rows × 50 columns


In [45]:
# if set scale to True, then 
# Scaled ES equal to es/gene_numbers
ses = es/df.shape[0]
ses


Out[45]:
786-0 BT-549 CCRF-CEM COLO 205 EKVX HCC-2998 HCT-15 HOP-62 HOP-92 HS 578T ... MCF7 MOLT-4 NCI-H460 OVCAR-4 SF-539 SK-MEL-5 SR UACC-257 UACC-62 UO-31
Term|ES
BvA_UpIN_B 0.028568 0.098095 0.183033 0.077260 0.073783 0.030670 0.114381 0.074202 0.055886 0.092757 ... 0.071907 0.144732 0.085765 0.085145 0.089106 0.054468 0.088714 0.034927 0.058028 0.066602
DvA_UpIN_A 0.053468 0.023203 0.014490 0.023390 0.047481 0.012210 -0.000791 0.031387 0.031117 0.058726 ... 0.012936 0.015975 0.061904 0.071893 0.032950 0.043867 0.016476 0.077716 0.087075 0.064487
DvA_UpIN_D 0.102484 0.121952 0.112373 0.080574 0.104464 0.087462 0.061305 0.085458 0.104787 0.117492 ... 0.098886 0.103656 0.129190 0.137931 0.110201 0.098033 0.086170 0.112869 0.106482 0.110693

3 rows × 50 columns


In [46]:
# NES 
# scale or no have no affects on final nes value
nes = ssgs[-1].res2d
nes.head()


Out[46]:
786-0 BT-549 CCRF-CEM COLO 205 EKVX HCC-2998 HCT-15 HOP-62 HOP-92 HS 578T ... MCF7 MOLT-4 NCI-H460 OVCAR-4 SF-539 SK-MEL-5 SR UACC-257 UACC-62 UO-31
Term|NES
BvA_UpIN_B 0.132580 0.455243 0.849433 0.358551 0.342419 0.142337 0.530826 0.344362 0.259360 0.430473 ... 0.333710 0.671683 0.398024 0.395145 0.413528 0.252780 0.411709 0.162091 0.269300 0.309089
DvA_UpIN_A 0.248137 0.107681 0.067247 0.108549 0.220351 0.056663 -0.003672 0.145661 0.144410 0.272539 ... 0.060036 0.074139 0.287290 0.333644 0.152915 0.203582 0.076462 0.360670 0.404104 0.299274
DvA_UpIN_D 0.475616 0.565961 0.521509 0.373935 0.484804 0.405900 0.284510 0.396600 0.486301 0.545264 ... 0.458916 0.481053 0.599555 0.640120 0.511426 0.454959 0.399905 0.523808 0.494169 0.513710

3 rows × 50 columns

3) command line usage of single sample gsea


In [47]:
# set --no-scale to obtain the real original enrichment score
# !gseapy ssgsea -d ./data/testSet_rand1200.gct \
#                -g data/temp.gmt \
#                -o test/ssgsea_report2  \
#                -p 4 --no-plot --no-scale

6. Replot Example

6.1 locate your directory

notes: replot module need to find edb folder to work properly. keep the file tree like this:

data
 |--- edb
 |    |--- C1OE.cls
 |    |--- gene_sets.gmt
 |    |--- gsea_data.gsea_data.rnk
 |    |--- results.edb

In [48]:
# run command inside python console
rep = gp.replot(indir="./data", outdir="test/replot_test")

6.2 command line usage of replot


In [49]:
# !gseapy replot -i data -o test/replot_test