In [1]:

    
# This cell contains default parameters values for execution by `papermill`.
filename = '../sample_data/postgap.20180817.asthma.txt.gz'



In [2]:

    
# Parameters
filename = "./sample_data/postgap.20180817.asthma.txt.gz"

POSTGAP Report

This notebook was automatically generated as a summary of POSTGAP output.

Setup

Note that for command line usage (python reporter.py <filename>) the following will work just fine. However, to edit the template, temporarily change the following to import helpers.



In [3]:

    
from reports import helpers



In [4]:

    
helpers.calc_run_str()









    



Notebook generated at 2019-01-10T10:58:11.273601 by gpeat



In [5]:

    
# pg = pd.read_csv(filename, sep='\t', na_values=['None'])
pg = helpers.load_file(filename)









    



/Users/gpeat/ensembl/postgap/tests/venv/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2850: DtypeWarning: Columns (21,44) have mixed types. Specify dtype option on import or set low_memory=False.
  if self.run_code(code, result):

Headline

Q: How many rows and columns?



In [6]:

    
print(pg.shape)









    



(189351, 99)

Q: How many unique target-disease associations?



In [7]:

    
helpers.calc_g2d_pair_counts(pg)









    





  
    
      field_pair
      unique_associations
    
  
  
    
      [gene_id, disease_efo_id]
      4377

Q: What is the distribution of unique diseases per gene? And vice versa?



In [8]:

    
helpers.calc_pairwise_degree_dist(pg, 'gene_id', 'disease_efo_id', 'Gene', 'Disease')









    



/Users/gpeat/ensembl/postgap/tests/venv/lib/python3.6/site-packages/matplotlib/axes/_base.py:3285: UserWarning: Attempting to set identical bottom==top results
in singular transformations; automatically expanding.
bottom=1.0, top=1.0
  'bottom=%s, top=%s') % (bottom, top))

Identifiers

Q: How many unique values appear for each identifier?



In [9]:

    
helpers.calc_id_field_counts(pg)









    





  
    
      field
      unique_values
    
  
  
    
      gene_id
      4377
    
    
      ld_snp_rsID
      8104
    
    
      gwas_snp
      458
    
    
      disease_efo_id
      1
    
    
      gwas_pmid
      39

Q: What is the maximum number of rows for a given fixed identifier?



In [10]:

    
helpers.calc_id_field_max_rows(pg)









    





  
    
      gene_id
      row_occurrences
    
  
  
    
      ENSG00000073605
      1312
    
    
      ENSG00000186075
      1295
    
    
      ENSG00000141741
      1293
    
  







    





  
    
      ld_snp_rsID
      row_occurrences
    
  
  
    
      rs11669540
      504
    
    
      rs11557466
      500
    
    
      rs12939832
      500
    
  







    





  
    
      gwas_snp
      row_occurrences
    
  
  
    
      rs16989837
      8331
    
    
      rs7216389
      5796
    
    
      rs2305480
      5768
    
  







    





  
    
      disease_efo_id
      row_occurrences
    
  
  
    
      EFO_0000270
      189351
    
  







    





  
    
      gwas_pmid
      row_occurrences
    
  
  
    
      PMID29273806
      37141
    
    
      PMID27611488
      23188
    
    
      PMID28461288
      16932

Identifier pairs

Q: How many unique identifier pairs appear?



In [11]:

    
helpers.calc_id_field_pair_counts(pg)









    





  
    
      field_pair
      unique_associations
    
  
  
    
      [gene_id, ld_snp_rsID]
      117424
    
    
      [ld_snp_rsID, gwas_snp]
      10466
    
    
      [gwas_snp, disease_efo_id]
      458
    
    
      [disease_efo_id, gwas_pmid]
      39

Gene-LD SNP associations

Q: What is the distribution of each association subscore (VEP, GTEx, etc.)?



In [12]:

    
helpers.calc_g2v_field_hists(pg)

Q: What is the distribution of unique LD SNPs per gene? And vice versa?



In [13]:

    
helpers.calc_pairwise_degree_dist(pg, 'gene_id', 'ld_snp_rsID', 'Gene', 'LD SNP')

Q: What is the overlap between presence of association subscores?



In [14]:

    
helpers.calc_g2v_field_overlap(pg)









    



['GTEx']                                          99950
['PCHiC', 'GTEx']                                  5125
['GTEx', 'Nearest']                                3807
['PCHiC']                                          2173
['VEP', 'GTEx', 'Nearest']                         2065
['VEP', 'GTEx']                                    1591
['Nearest']                                         585
['VEP', 'PCHiC', 'GTEx', 'Nearest']                 404
['VEP', 'Nearest']                                  366
['VEP']                                             282
['GTEx', 'DHS']                                     262
['PCHiC', 'GTEx', 'Nearest']                        227
['VEP', 'PCHiC', 'GTEx']                            135
['Regulome', 'GTEx']                                 84
['VEP', 'PCHiC', 'Nearest']                          52
['GTEx', 'Fantom5']                                  49
['DHS']                                              35
['PCHiC', 'GTEx', 'DHS']                             32
['VEP', 'GTEx', 'DHS', 'Nearest']                    28
['VEP', 'PCHiC']                                     27
['PCHiC', 'Nearest']                                 27
['GTEx', 'DHS', 'Nearest']                            9
['PCHiC', 'GTEx', 'Fantom5']                          9
['VEP', 'GTEx', 'DHS']                                9
['VEP', 'PCHiC', 'GTEx', 'Fantom5', 'Nearest']        8
['GTEx', 'Fantom5', 'Nearest']                        7
['Regulome', 'PCHiC', 'GTEx']                         7
['Regulome', 'PCHiC']                                 7
['VEP', 'GTEx', 'Fantom5', 'Nearest']                 7
['Fantom5']                                           5
['PCHiC', 'Fantom5']                                  5
['GTEx', 'Fantom5', 'DHS']                            4
['PCHiC', 'DHS']                                      4
['DHS', 'Nearest']                                    3
['Regulome', 'GTEx', 'Nearest']                       3
['VEP', 'Regulome', 'GTEx', 'Nearest']                3
['VEP', 'DHS', 'Nearest']                             3
['VEP', 'DHS']                                        3
['PCHiC', 'GTEx', 'Fantom5', 'Nearest']               2
['VEP', 'Regulome']                                   2
['VEP', 'Fantom5']                                    2
['VEP', 'PCHiC', 'GTEx', 'DHS', 'Nearest']            2
['VEP', 'PCHiC', 'GTEx', 'DHS']                       2
['Regulome', 'Nearest']                               2
['VEP', 'GTEx', 'Fantom5']                            2
['VEP', 'GTEx', 'Fantom5', 'DHS', 'Nearest']          1
['PCHiC', 'GTEx', 'Fantom5', 'DHS']                   1
['VEP', 'PCHiC', 'DHS']                               1
['VEP', 'PCHiC', 'Fantom5', 'Nearest']                1
['VEP', 'Regulome', 'PCHiC', 'Nearest']               1
['VEP', 'Regulome', 'GTEx']                           1
['VEP', 'PCHiC', 'GTEx', 'Fantom5']                   1
['VEP', 'Regulome', 'Nearest']                        1
dtype: int64

Q: What is the joint distribution between association subscore pairs (ie. how correlated are they)?



In [15]:

    
helpers.calc_g2v_field_cross_dists(pg)

LD SNP-GWAS SNP associations

Q: What is the distribution of r2?



In [16]:

    
helpers.calc_dist_r2(pg)

Q: What is the distribution of unique GWAS SNPs per LD SNP? And vice versa?



In [17]:

    
helpers.calc_pairwise_degree_dist(pg, 'ld_snp_rsID', 'gwas_snp', 'LD SNP', 'GWAS SNP')

GWAS SNP-Disease associations

Q: What are the distributions of (gwas_pvalue, gwas_beta, gwas_odds_ratio)?



In [18]:

    
helpers.calc_v2d_field_hists(pg)

Q: What is the distribution of unique diseases per GWAS SNP? And vice versa?



In [19]:

    
helpers.calc_pairwise_degree_dist(pg, 'gwas_snp', 'disease_efo_id', 'GWAS SNP', 'Disease')









    



/Users/gpeat/ensembl/postgap/tests/venv/lib/python3.6/site-packages/matplotlib/axes/_base.py:3285: UserWarning: Attempting to set identical bottom==top results
in singular transformations; automatically expanding.
bottom=1.0, top=1.0
  'bottom=%s, top=%s') % (bottom, top))

field	unique_values
gene_id	4377
ld_snp_rsID	8104
gwas_snp	458
disease_efo_id	1
gwas_pmid	39

field_pair	unique_associations
[gene_id, ld_snp_rsID]	117424
[ld_snp_rsID, gwas_snp]	10466
[gwas_snp, disease_efo_id]	458
[disease_efo_id, gwas_pmid]	39