In [1]:
# This cell contains default parameters values for execution by `papermill`.
filename = '../sample_data/postgap.20180817.asthma.txt.gz'

In [2]:
# Parameters
filename = "./sample_data/postgap.20180817.asthma.txt.gz"

POSTGAP Report

This notebook was automatically generated as a summary of POSTGAP output.

Setup

Note that for command line usage (python reporter.py <filename>) the following will work just fine. However, to edit the template, temporarily change the following to import helpers.


In [3]:
from reports import helpers

In [4]:
helpers.calc_run_str()


Notebook generated at 2019-01-10T10:58:11.273601 by gpeat

In [5]:
# pg = pd.read_csv(filename, sep='\t', na_values=['None'])
pg = helpers.load_file(filename)


/Users/gpeat/ensembl/postgap/tests/venv/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2850: DtypeWarning: Columns (21,44) have mixed types. Specify dtype option on import or set low_memory=False.
  if self.run_code(code, result):

Headline

Q: How many rows and columns?


In [6]:
print(pg.shape)


(189351, 99)

Q: How many unique target-disease associations?


In [7]:
helpers.calc_g2d_pair_counts(pg)


field_pair unique_associations
[gene_id, disease_efo_id] 4377

Q: What is the distribution of unique diseases per gene? And vice versa?


In [8]:
helpers.calc_pairwise_degree_dist(pg, 'gene_id', 'disease_efo_id', 'Gene', 'Disease')


/Users/gpeat/ensembl/postgap/tests/venv/lib/python3.6/site-packages/matplotlib/axes/_base.py:3285: UserWarning: Attempting to set identical bottom==top results
in singular transformations; automatically expanding.
bottom=1.0, top=1.0
  'bottom=%s, top=%s') % (bottom, top))

Identifiers

Q: How many unique values appear for each identifier?


In [9]:
helpers.calc_id_field_counts(pg)


field unique_values
gene_id 4377
ld_snp_rsID 8104
gwas_snp 458
disease_efo_id 1
gwas_pmid 39

Q: What is the maximum number of rows for a given fixed identifier?


In [10]:
helpers.calc_id_field_max_rows(pg)


gene_id row_occurrences
ENSG00000073605 1312
ENSG00000186075 1295
ENSG00000141741 1293
ld_snp_rsID row_occurrences
rs11669540 504
rs11557466 500
rs12939832 500
gwas_snp row_occurrences
rs16989837 8331
rs7216389 5796
rs2305480 5768
disease_efo_id row_occurrences
EFO_0000270 189351
gwas_pmid row_occurrences
PMID29273806 37141
PMID27611488 23188
PMID28461288 16932

Identifier pairs

Q: How many unique identifier pairs appear?


In [11]:
helpers.calc_id_field_pair_counts(pg)


field_pair unique_associations
[gene_id, ld_snp_rsID] 117424
[ld_snp_rsID, gwas_snp] 10466
[gwas_snp, disease_efo_id] 458
[disease_efo_id, gwas_pmid] 39

Gene-LD SNP associations

Q: What is the distribution of each association subscore (VEP, GTEx, etc.)?


In [12]:
helpers.calc_g2v_field_hists(pg)


Q: What is the distribution of unique LD SNPs per gene? And vice versa?


In [13]:
helpers.calc_pairwise_degree_dist(pg, 'gene_id', 'ld_snp_rsID', 'Gene', 'LD SNP')


Q: What is the overlap between presence of association subscores?


In [14]:
helpers.calc_g2v_field_overlap(pg)


['GTEx']                                          99950
['PCHiC', 'GTEx']                                  5125
['GTEx', 'Nearest']                                3807
['PCHiC']                                          2173
['VEP', 'GTEx', 'Nearest']                         2065
['VEP', 'GTEx']                                    1591
['Nearest']                                         585
['VEP', 'PCHiC', 'GTEx', 'Nearest']                 404
['VEP', 'Nearest']                                  366
['VEP']                                             282
['GTEx', 'DHS']                                     262
['PCHiC', 'GTEx', 'Nearest']                        227
['VEP', 'PCHiC', 'GTEx']                            135
['Regulome', 'GTEx']                                 84
['VEP', 'PCHiC', 'Nearest']                          52
['GTEx', 'Fantom5']                                  49
['DHS']                                              35
['PCHiC', 'GTEx', 'DHS']                             32
['VEP', 'GTEx', 'DHS', 'Nearest']                    28
['VEP', 'PCHiC']                                     27
['PCHiC', 'Nearest']                                 27
['GTEx', 'DHS', 'Nearest']                            9
['PCHiC', 'GTEx', 'Fantom5']                          9
['VEP', 'GTEx', 'DHS']                                9
['VEP', 'PCHiC', 'GTEx', 'Fantom5', 'Nearest']        8
['GTEx', 'Fantom5', 'Nearest']                        7
['Regulome', 'PCHiC', 'GTEx']                         7
['Regulome', 'PCHiC']                                 7
['VEP', 'GTEx', 'Fantom5', 'Nearest']                 7
['Fantom5']                                           5
['PCHiC', 'Fantom5']                                  5
['GTEx', 'Fantom5', 'DHS']                            4
['PCHiC', 'DHS']                                      4
['DHS', 'Nearest']                                    3
['Regulome', 'GTEx', 'Nearest']                       3
['VEP', 'Regulome', 'GTEx', 'Nearest']                3
['VEP', 'DHS', 'Nearest']                             3
['VEP', 'DHS']                                        3
['PCHiC', 'GTEx', 'Fantom5', 'Nearest']               2
['VEP', 'Regulome']                                   2
['VEP', 'Fantom5']                                    2
['VEP', 'PCHiC', 'GTEx', 'DHS', 'Nearest']            2
['VEP', 'PCHiC', 'GTEx', 'DHS']                       2
['Regulome', 'Nearest']                               2
['VEP', 'GTEx', 'Fantom5']                            2
['VEP', 'GTEx', 'Fantom5', 'DHS', 'Nearest']          1
['PCHiC', 'GTEx', 'Fantom5', 'DHS']                   1
['VEP', 'PCHiC', 'DHS']                               1
['VEP', 'PCHiC', 'Fantom5', 'Nearest']                1
['VEP', 'Regulome', 'PCHiC', 'Nearest']               1
['VEP', 'Regulome', 'GTEx']                           1
['VEP', 'PCHiC', 'GTEx', 'Fantom5']                   1
['VEP', 'Regulome', 'Nearest']                        1
dtype: int64

Q: What is the joint distribution between association subscore pairs (ie. how correlated are they)?


In [15]:
helpers.calc_g2v_field_cross_dists(pg)


LD SNP-GWAS SNP associations

Q: What is the distribution of r2?


In [16]:
helpers.calc_dist_r2(pg)


Q: What is the distribution of unique GWAS SNPs per LD SNP? And vice versa?


In [17]:
helpers.calc_pairwise_degree_dist(pg, 'ld_snp_rsID', 'gwas_snp', 'LD SNP', 'GWAS SNP')


GWAS SNP-Disease associations

Q: What are the distributions of (gwas_pvalue, gwas_beta, gwas_odds_ratio)?


In [18]:
helpers.calc_v2d_field_hists(pg)


Q: What is the distribution of unique diseases per GWAS SNP? And vice versa?


In [19]:
helpers.calc_pairwise_degree_dist(pg, 'gwas_snp', 'disease_efo_id', 'GWAS SNP', 'Disease')


/Users/gpeat/ensembl/postgap/tests/venv/lib/python3.6/site-packages/matplotlib/axes/_base.py:3285: UserWarning: Attempting to set identical bottom==top results
in singular transformations; automatically expanding.
bottom=1.0, top=1.0
  'bottom=%s, top=%s') % (bottom, top))