tutorial: Data Camp Jupyter Notebook Tutorial

  • notebook focuses on analysis of data from following command set

Commands Run To Create DataSets

  • all commands run in linux using python 3.5.2
  • run times given are for i7-5820K w/ 64 GB RAM
  • processing is CPU not memory intensive
  • some code could be parallized to reduce run times

    cd /data640g1/data/documents/docs/projects/payton_venusar/VENUSAR_DEV/venusar

Compute Motif Thresholds

  • Motif = TF, language used interchangeably
  • 19:11:33 - 17:36:16 = 95 minutes 17 seconds run time

    python3 thresholds.py -fpr 0.001 -m ../../data/HOCOMOCOv10.JASPAR_FORMAT.TF_IDS.txt -o ../../data/HOCOMOCOv10.JASPAR_FORMAT.TF_IDS.fpr_0p001.txt

Filter Motifs by Gene Expression

  • time to run is inconsequential; under a second

    python3 tf_expression.py -i ../../data/FLDL_CCCB_RARE_VARIANTS.MERGED.RNA_DP10.RNA_NODUPS.CHIP_MULTIMARK.SORTED.vcf -o 1 -m ../../data/HOCOMOCOv10.JASPAR_FORMAT.TF_IDS.fpr_0p001.txt -e ../../data/ALL_ARRAYS_NORMALIZED_MAXPROBE_LOG2_COORDS.sorted.txt -mo ../../data/HOCOMOCOv10.JASPAR_FORMAT.TF_IDS.fpr_0p001.txt.bed_reduced.txt

Modify vcf File to Show Which Motif Matches Changed (reference vs variant)

  • runtime ~11 hours with homotypic (without 20170115.02:19:36-20170114.19:10:53 = 7 hours 8 minutes 43 seconds)

    python3 motifs.py -i ../../data/FLDL_CCCB_RARE_VARIANTS.MERGED.RNA_DP10.RNA_NODUPS.CHIP_MULTIMARK.SORTED.vcf -r ../../data/genome_reference/reference_genome_hg19.fa -m ../../data/HOCOMOCOv10.JASPAR_FORMAT.TF_IDS.fpr_0p001.txt.bed_reduced.txt -o ../../data/output.motif.20170114.vcf -fm -fp -ci ../../data/GM12878.ENCODE.ALL_TFS.bed -co ../../data/output.chip_peaks_output.20170114.bed &> ../../data/0_run_logs/20170114_motifs_run_stdout.txt

z-scores for Chip Peaks in samples with/without variant

  • 15:27:32 - 14:51:01 = 36 min 27 seconds run time
  • -i is either output of tf_expression if run after motifs, or motifs if it was run later

    python3 activity.py -i ../../data/output.motif.20170114.vcf -a ../../data/QN_FLDL_CCCB_K27AC_PEAKS_SIGNAL.bed -ov ../../data/output.activity.20170114.vcf -ob ../../data/output.activity.20170114.bed -th 2 &> ../../data/0_run_logs/20170114_activity_run_stdout.txt

z-scores for genes with specified distance from variant

  • 15:27:45 - 15:27:33 = 12 seconds run time

    python3 gene_expression.py -i ../../data/output.activity.20170114.vcf -e ../../data/ALL_ARRAYS_NORMALIZED_MAXPROBE_LOG2_COORDS.sorted.txt -ov ../../data/output.gene_expression.20170114.vcf -eth 5 -th 2 &> ../../data/0_run_logs/20170114_gene_expression_run_stdout.txt


In [2]:
import venusar
import motif
import thresholds
import motifs
import activity
import tf_expression
import gene_expression

In [54]:
# plotly by default wants to put everything in the cloud
#    ref: https://plot.ly/python/getting-started/
# must run this code before plotly command calls to avoid account setup request and allow local usage
#    note: plotly.plotly methods are cloud only (ridiculous should be able to use localhost server)
#    instead must use plotly.offline and plotly.iplot
#    iplot is jupyter notebook specific
# Plotly Offline allows you to create graphs offline and save them locally.
#   two methods for plotting offline: plotly.offline.plot() and plotly.offline.iplot().
# ref:https://plot.ly/python/offline/
#
# ref: generally plotly plot ref: https://plot.ly/python/user-guide/

#import plotly.plotly as py # cloud only
#import plotly.graph_objs as go
import plotly
#help(plotly.offline.iplot)
print( plotly.__version__ )

plotly.offline.init_notebook_mode(connected=True)


2.0.0

In [39]:
motif_f_base='../../data/HOCOMOCOv10.JASPAR_FORMAT.TF_IDS.txt'
pc = 0.1
th = 0
bp = [0.25, 0.25, 0.25, 0.25]
motif_set_base = motif.get_motifs(motif_f_base, pc, th, bp)

In [40]:
motif_f_expressed='../../data/HOCOMOCOv10.JASPAR_FORMAT.TF_IDS.fpr_0p001.txt.bed_reduced.txt'
motif_set_expressed = motif.get_motifs(motif_f_expressed, pc, th, bp)

In [41]:
print( "length " + format(len(motif_set_expressed.motifs)) )

print( "length " + format(motif_set_expressed.length()) )


length 603
length 603

In [58]:
motif_lengths_base = motif_set_base.element_positions_list(False)
motif_lengths_expressed = motif_set_expressed.element_positions_list(False)
print( "motif_lengths_base type: " + format(type(motif_lengths_base)) + " has " + format(len(motif_lengths_base)) + " elements.")
print( "motif_lengths_expressed type: " + format(type(motif_lengths_expressed)) + " has " + format(len(motif_lengths_expressed)) + " elements.")
print( "motif_lengths_expressed has " + format(len(motif_lengths_base) - len(motif_lengths_expressed)) + " fewer elements.")


motif_lengths_base type: <class 'list'> has 641 elements.
motif_lengths_expressed type: <class 'list'> has 603 elements.
motif_lengths_expressed has 38 fewer elements.

In [46]:
# ref: https://plot.ly/python/histograms/

data = [
    plotly.graph_objs.Histogram(
        x=motif_lengths_base
    )
]

layout = plotly.graph_objs.Layout(
    title='TF Length Histogram after running tf_expression.py',
    xaxis=dict(
        title='TF Length',
        ticklen=5,
        zeroline=False,
        gridwidth=2,
    ),
    yaxis={'title':'Number of TF with Length'}
    )
#plotly.offline.iplot(data)  # basic plot
go_figure = plotly.graph_objs.Figure(data=data, layout=layout)
plotly.offline.iplot(go_figure)



In [62]:
# repeating graph setup using reproducible code
# ref: https://plot.ly/python/histograms/

# -- setup the layout information to reuse
xaxis_template = {
        'title':'TF Length',
        'ticklen':5,
        'zeroline':False,
        'gridwidth':2,
    }
yaxis_template ={'title':'Number of TF with Length'}

# -- plot 1
hist_all_TF = plotly.graph_objs.Histogram(
        name='All TF',
        x=motif_lengths_base,
        opacity=0.75
    )
layout = plotly.graph_objs.Layout(
    title='TF Length Histogram (all TF)',
    xaxis=xaxis_template,
    yaxis=yaxis_template
    )

go_figure = plotly.graph_objs.Figure(data=[hist_all_TF], layout=layout)
plotly.offline.iplot(go_figure)

# -- plot 2
hist_expressed_TF = plotly.graph_objs.Histogram(
        name='Expressed TF',
        x=motif_lengths_expressed,
        opacity=0.75
    )
layout = plotly.graph_objs.Layout(
    title='TF Length Histogram after running tf_expression.py',
    xaxis=xaxis_template,
    yaxis=yaxis_template
    )

go_figure = plotly.graph_objs.Figure(data=[hist_expressed_TF], layout=layout)
plotly.offline.iplot(go_figure)

# -- plot overlay
layout = plotly.graph_objs.Layout(
    title='TF Length Histogram comparison tf_expression.py dropped',
    xaxis=xaxis_template,
    yaxis=yaxis_template,
    barmode='overlay'
    )
go_figure = plotly.graph_objs.Figure(data=[hist_all_TF, hist_expressed_TF], layout=layout)
plotly.offline.iplot(go_figure)



In [61]:
help(go_figure.layout)
go_figure.layout


Help on PlotlyDict in module plotly.graph_objs.graph_objs object:

class PlotlyDict(builtins.dict, PlotlyBase)
 |  Base class for dict-like Plotly objects.
 |  
 |  Method resolution order:
 |      PlotlyDict
 |      builtins.dict
 |      PlotlyBase
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __copy__(self)
 |  
 |  __deepcopy__(self, memodict={})
 |  
 |  __dir__(self)
 |      Dynamically return the existing and possible attributes.
 |  
 |  __getattr__(self, key)
 |      Python only calls this when key is missing!
 |  
 |  __getitem__(self, key)
 |      Calls __missing__ when key is not found. May mutate object.
 |  
 |  __init__(self, *args, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __missing__(self, key)
 |      Mimics defaultdict. This is called from __getitem__ when key DNE.
 |  
 |  __setattr__(self, key, value)
 |      Maps __setattr__ onto __setitem__
 |  
 |  __setitem__(self, key, value, _raise=True)
 |      Validates/Converts values which should be Graph Objects.
 |  
 |  force_clean(self, **kwargs)
 |      Recursively remove empty/None values.
 |  
 |  get_data(self, flatten=False)
 |      Returns the JSON for the plot with non-data elements stripped.
 |  
 |  get_ordered(self, **kwargs)
 |      Return a predictable, OrderedDict version of self.
 |  
 |  help(self, attribute=None, return_help=False)
 |      Print help string for this object or an attribute of this object.
 |      
 |      :param (str) attribute: A valid attribute string for this object.
 |      :param (bool) return_help: Return help_string instead of printing it?
 |      :return: (None|str)
 |  
 |  strip_style(self)
 |      Recursively strip style from the current representation.
 |      
 |      All PlotlyDicts and PlotlyLists are guaranteed to survive the
 |      stripping process, though they made be left empty. This is allowable.
 |      
 |      Keys that will be stripped in this process are tagged with
 |      `'type': 'style'` in graph_objs_meta.json. Note that a key tagged as
 |      style, but with an array as a value may still be considered data.
 |  
 |  to_string(self, level=0, indent=4, eol='\n', pretty=True, max_chars=80)
 |      Returns a formatted string showing graph_obj constructors.
 |      
 |      :param (int) level: The number of indentations to start with.
 |      :param (int) indent: The indentation amount.
 |      :param (str) eol: The end of line character(s).
 |      :param (bool) pretty: Curtail long list output with a '..' ?
 |      :param (int) max_chars: The max characters per line.
 |      
 |      Example:
 |      
 |          print(obj.to_string())
 |  
 |  update(self, dict1=None, **dict2)
 |      Update current dict with dict1 and then dict2.
 |      
 |      This recursively updates the structure of the original dictionary-like
 |      object with the new entries in the second and third objects. This
 |      allows users to update with large, nested structures.
 |      
 |      Note, because the dict2 packs up all the keyword arguments, you can
 |      specify the changes as a list of keyword agruments.
 |      
 |      Examples:
 |      # update with dict
 |      obj = Layout(title='my title', xaxis=XAxis(range=[0,1], domain=[0,1]))
 |      update_dict = dict(title='new title', xaxis=dict(domain=[0,.8]))
 |      obj.update(update_dict)
 |      obj
 |      {'title': 'new title', 'xaxis': {'range': [0,1], 'domain': [0,.8]}}
 |      
 |      # update with list of keyword arguments
 |      obj = Layout(title='my title', xaxis=XAxis(range=[0,1], domain=[0,1]))
 |      obj.update(title='new title', xaxis=dict(domain=[0,.8]))
 |      obj
 |      {'title': 'new title', 'xaxis': {'range': [0,1], 'domain': [0,.8]}}
 |      
 |      This 'fully' supports duck-typing in that the call signature is
 |      identical, however this differs slightly from the normal update
 |      method provided by Python's dictionaries.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from builtins.dict:
 |  
 |  __contains__(self, key, /)
 |      True if D has a key k, else False.
 |  
 |  __delitem__(self, key, /)
 |      Delete self[key].
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __iter__(self, /)
 |      Implement iter(self).
 |  
 |  __le__(self, value, /)
 |      Return self<=value.
 |  
 |  __len__(self, /)
 |      Return len(self).
 |  
 |  __lt__(self, value, /)
 |      Return self<value.
 |  
 |  __ne__(self, value, /)
 |      Return self!=value.
 |  
 |  __new__(*args, **kwargs) from builtins.type
 |      Create and return a new object.  See help(type) for accurate signature.
 |  
 |  __repr__(self, /)
 |      Return repr(self).
 |  
 |  __sizeof__(...)
 |      D.__sizeof__() -> size of D in memory, in bytes
 |  
 |  clear(...)
 |      D.clear() -> None.  Remove all items from D.
 |  
 |  copy(...)
 |      D.copy() -> a shallow copy of D
 |  
 |  fromkeys(iterable, value=None, /) from builtins.type
 |      Returns a new dict with keys from iterable and values equal to value.
 |  
 |  get(...)
 |      D.get(k[,d]) -> D[k] if k in D, else d.  d defaults to None.
 |  
 |  items(...)
 |      D.items() -> a set-like object providing a view on D's items
 |  
 |  keys(...)
 |      D.keys() -> a set-like object providing a view on D's keys
 |  
 |  pop(...)
 |      D.pop(k[,d]) -> v, remove specified key and return the corresponding value.
 |      If key is not found, d is returned if given, otherwise KeyError is raised
 |  
 |  popitem(...)
 |      D.popitem() -> (k, v), remove and return some (key, value) pair as a
 |      2-tuple; but raise KeyError if D is empty.
 |  
 |  setdefault(...)
 |      D.setdefault(k[,d]) -> D.get(k,d), also set D[k]=d if k not in D
 |  
 |  values(...)
 |      D.values() -> an object providing a view on D's values
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes inherited from builtins.dict:
 |  
 |  __hash__ = None
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from PlotlyBase:
 |  
 |  to_graph_objs(self, **kwargs)
 |      Everything is cast into graph_objs. Here for backwards compat.
 |  
 |  validate(self)
 |      Everything is *always* validated now. keep for backwards compat.

Out[61]:
{'barmode': 'overlay',
 'title': 'TF Length Histogram comparison tf_expression.py dropped',
 'xaxis': {'gridwidth': 2,
  'ticklen': 5,
  'title': 'TF Length',
  'zeroline': False},
 'yaxis': {'title': 'Number of TF with Length'}}

In [ ]:
# next create a reader that exports motifs=TF, samples, and variant information from the vcf file
#   as a python structure
#
#   get vcf reader for variant elements from motifs.py
#   get vcf reader for samples from gene_expression.py
#
# then do similar and expanded analysis looking at which motifs were selected/exported by variant
# could also group by variant type
#   a->t, etc
#

Expression Plots

z-scores for genes with specified distance from variant

  • run-time seconds
  • operates against gene_expression.py output vcf file

    python3 summarize.py -i ../../data/output.gene_expression.20170114.vcf -o ../../data/output.gene_expression.20170114.table


In [1]:
# create a pandas data frame of the data table
import pandas as pandas
vcf_table_file = '../../data/output.gene_expression.20170114.table_full.txt'
vcf_table = pandas.read_csv(vcf_table_file, sep='\t')

In [3]:
# confirm type
type(vcf_table)


Out[3]:
pandas.core.frame.DataFrame

In [10]:
vcf_table.sort_values('DISTANCE', ascending=False)


Out[10]:
CHR POS REF ALT MOTIF LOG2_VAR-REF_SCORE SAMPLE LOCIID ACT_ZSCORE GENE EXP_ZSCORE DISTANCE
40525 chr11 102217809 C CCT SCRT2 34.0930 CB021314 7230 2.4329 BIRC3 -2.5774 34.276736
40520 chr11 102217809 C CCT SCRT2 34.0930 FL313 7230 3.1783 TMEM123 -0.8938 34.252491
40521 chr11 102217809 C CCT SCRT2 34.0930 FL313 7230 3.1783 BIRC2 0.8854 34.252273
40519 chr11 102217809 C CCT SCRT2 34.0930 FL313 7230 3.1783 BIRC3 0.7117 34.248223
40527 chr11 102217809 C CCT SCRT2 34.0930 CB021314 7230 2.4329 BIRC2 -1.5596 34.215260
40526 chr11 102217809 C CCT SCRT2 34.0930 CB021314 7230 2.4329 TMEM123 0.9745 34.193586
40523 chr11 102217809 C CCT SCRT2 34.0930 DL551 7230 -1.1479 TMEM123 -1.5653 34.148214
40522 chr11 102217809 C CCT SCRT2 34.0930 DL551 7230 -1.1479 BIRC3 -1.2232 34.134243
40524 chr11 102217809 C CCT SCRT2 34.0930 DL551 7230 -1.1479 BIRC2 -0.8806 34.123684
40530 chr11 102217809 C CCT SCRT2 34.0930 DL3A538 7230 1.3188 BIRC2 0.3539 34.120333
40529 chr11 102217809 C CCT SCRT2 34.0930 DL3A538 7230 1.3188 TMEM123 -0.2827 34.119669
40528 chr11 102217809 C CCT SCRT2 34.0930 DL3A538 7230 1.3188 BIRC3 -0.1391 34.118781
113172 chr19 34882472 A G,T ATF7 33.3535 FL313 19270 -3.9604 GPI -1.5128 33.621857
113171 chr19 34882472 A G,T ATF7 33.3535 FL313 19270 -3.9604 UBA2 -1.2825 33.612283
113169 chr19 34882472 A G,T ATF7 33.3535 FL313 19270 -3.9604 PDCD2L -0.8958 33.599750
113170 chr19 34882472 A G,T ATF7 33.3535 FL313 19270 -3.9604 KIAA0355 -0.4956 33.591462
113162 chr19 34882472 A G,T ATF7 33.3535 FL3A145 19270 -2.8333 KIAA0355 -0.4204 33.476265
113161 chr19 34882472 A G,T ATF7 33.3535 FL3A145 19270 -2.8333 PDCD2L -0.4147 33.476193
113164 chr19 34882472 A G,T ATF7 33.3535 FL3A145 19270 -2.8333 GPI -0.3142 33.475099
113163 chr19 34882472 A G,T ATF7 33.3535 FL3A145 19270 -2.8333 UBA2 0.0373 33.473645
113186 chr19 34882472 A G,T ATF7 33.3535 DL551 19270 1.1417 KIAA0355 -2.4354 33.461778
113193 chr19 34882472 A G,T ATF7 33.3535 CB012214 19270 -2.3369 PDCD2L 1.1652 33.455564
113196 chr19 34882472 A G,T ATF7 33.3535 CB012214 19270 -2.3369 GPI 0.6272 33.441149
113194 chr19 34882472 A G,T ATF7 33.3535 CB012214 19270 -2.3369 KIAA0355 0.4093 33.437772
113195 chr19 34882472 A G,T ATF7 33.3535 CB012214 19270 -2.3369 UBA2 -0.3862 33.437497
113182 chr19 34882472 A G,T ATF7 33.3535 FL174 19270 -1.0408 KIAA0355 -1.9927 33.429180
113189 chr19 34882472 A G,T ATF7 33.3535 FL255 19270 1.4130 PDCD2L -1.6874 33.426036
113174 chr19 34882472 A G,T ATF7 33.3535 DL135 19270 1.4505 KIAA0355 -1.6321 33.424896
113200 chr19 34882472 A G,T ATF7 33.3535 DL237 19270 1.3315 GPI -1.6818 33.422407
113176 chr19 34882472 A G,T ATF7 33.3535 DL135 19270 1.4505 GPI 1.5720 33.422015
... ... ... ... ... ... ... ... ... ... ... ... ...
141325 chr22 50683032 ATTTTTT ATTTTTTT ARID5B 0.0000 FL174 26126 0.1491 PLXNB2 -0.0101 0.149442
141850 chr22 50683032 ATTTTTT ATTTTTTT ETV2 0.0000 FL174 26126 0.1491 PLXNB2 -0.0101 0.149442
141920 chr22 50683032 ATTTTTT ATTTTTTT SOX5 0.0000 FL174 26126 0.1491 PLXNB2 -0.0101 0.149442
142235 chr22 50683032 ATTTTTT ATTTTTTT SOX3 0.0000 FL174 26126 0.1491 PLXNB2 -0.0101 0.149442
141535 chr22 50683032 ATTTTTT ATTTTTTT GRHL1 0.0000 FL174 26126 0.1491 PLXNB2 -0.0101 0.149442
142480 chr22 50683032 ATTTTTT ATTTTTTT SOX7 0.0000 FL174 26126 0.1491 PLXNB2 -0.0101 0.149442
141640 chr22 50683032 ATTTTTT ATTTTTTT HOXD10 0.0000 FL174 26126 0.1491 PLXNB2 -0.0101 0.149442
141395 chr22 50683032 ATTTTTT ATTTTTTT ZNF652 0.0000 FL174 26126 0.1491 PLXNB2 -0.0101 0.149442
141360 chr22 50683032 ATTTTTT ATTTTTTT HMBOX1 0.0000 FL174 26126 0.1491 PLXNB2 -0.0101 0.149442
142095 chr22 50683032 ATTTTTT ATTTTTTT HMGA1 0.0000 FL174 26126 0.1491 PLXNB2 -0.0101 0.149442
142410 chr22 50683032 ATTTTTT ATTTTTTT SOX11 0.0000 FL174 26126 0.1491 PLXNB2 -0.0101 0.149442
26633 chr11 47586721 TGGG TGGGG HEY2 0.0000 DL252 6414 -0.0734 PTPMT1 0.1216 0.142036
26549 chr11 47586721 TGGG TGGGG NHLH1 0.0000 DL252 6414 -0.0734 PTPMT1 0.1216 0.142036
168695 chr6 26198064 ACTCT ACTCTCT TCF7 0.0000 FL238 33676 0.0030 HIST1H2AD -0.1406 0.140632
77906 chr16 23519606 T C ESR1 -0.0126 DL551 14480 -0.0372 EARS2 0.1336 0.139254
77936 chr16 23519606 T C NR3C2 0.0000 DL551 14480 -0.0372 EARS2 0.1336 0.138682
146415 chr3 81810702 AGG AGGG ZFX 0.0000 DL237 27480 0.1289 GBE1 -0.0165 0.129952
177228 chr6 43445414 A G,C GSC2 0.0436 FL174 34264 0.0934 DLK2 0.0786 0.129624
149563 chr3 114160530 A G HIVEP2 0.0000 FL301 27815 -0.0684 ZBTB20 0.0690 0.097157
149683 chr3 114160530 A G RFX1 0.0000 FL301 27815 -0.0684 ZBTB20 0.0690 0.097157
163130 chr5 177547247 T C TP53 0.0556 FL153 33012 -0.0756 RMND5B 0.0121 0.094621
69806 chr15 66081098 T C NR5A2 0.0820 FL174 13285 -0.0083 DENND4A 0.0430 0.092962
88132 chr17 4851072 A G NR2F2 -0.0874 FL202 15627 0.0171 GP1BA -0.0243 0.092313
168698 chr6 26198064 ACTCT ACTCTCT TCF7 0.0000 FL238 33676 0.0030 HIST1H1E 0.0896 0.089650
73259 chr15 79235161 C G FOXA1 0.0468 FL255 13588 0.0001 CTSH 0.0591 0.075386
80934 chr16 81849039 C T PLAG1 0.0000 FL153 15256 -0.0186 PLCG2 0.0422 0.046117
81105 chr16 81849039 C T INSM1 0.0000 FL153 15256 -0.0186 PLCG2 0.0422 0.046117
191724 chr9 131444420 T C ZSCAN16 0.0000 FL174 41337 -0.0037 SPTAN1 -0.0418 0.041963
32499 chr11 65407855 A G NR2F2 0.0154 CB021314 6705 -0.0042 KCNK7 0.0080 0.017855
32503 chr11 65407855 A G NR2F2 0.0154 CB021314 6705 -0.0042 PCNXL3 -0.0067 0.017312

195808 rows × 12 columns


In [21]:
vcf_table.size


Out[21]:
2349696

In [22]:
vcf_table[vcf_table.DISTANCE > 16].size


Out[22]:
46080

In [16]:
vcf_table.plot.hist()


/usr/lib/python3/dist-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
/usr/lib/python3/dist-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9d6d284160>

In [52]:
# XXX: This doesn't work yet; fails creating new column
# ref: https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf
# ref: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
import numpy
motif_score_scale = max(vcf_table['LOG2_VAR-REF_SCORE'] ** 2)
act_score_scale = max(vcf_table.ACT_ZSCORE ** 2)
gene_score_scale = max(vcf_table.EXP_ZSCORE ** 2)

print('Score Scaling Factors:\n\tMotif = \t', motif_score_scale, '\n\tActivity = \t', act_score_scale, '\n\tGene = \t\t', gene_score_scale ) 
#vcf_table['distance_w'] = vcf_table.apply( numpy.sqrt, ((vcf_table['LOG2_VAR-REF_SCORE'] ** 2) / motif_score_scale) + 
#                               ((vcf_table.ACT_ZSCORE ** 2) / act_score_scale ) + 
#                               ((vcf_table.EXP_ZSCORE ** 2) / gene_score_scale ) )
vcf_table['distance_w'] = numpy.sqrt( ((vcf_table['LOG2_VAR-REF_SCORE'] ** 2) / motif_score_scale) + ((vcf_table.ACT_ZSCORE ** 2) / act_score_scale ) + ((vcf_table.EXP_ZSCORE ** 2) / gene_score_scale ) )
vcf_table.sort_values('distance_w', ascending=False)


Score Scaling Factors:
	Motif = 	 1162.332649 
	Activity = 	 198.51964609 
	Gene = 		 67.24328004
Out[52]:
CHR POS REF ALT MOTIF LOG2_VAR-REF_SCORE SAMPLE LOCIID ACT_ZSCORE GENE EXP_ZSCORE DISTANCE distance_w
167811 chr6 24990939 GAAAAAA GAAAAAAA,GAAAAAAAA FOXL1 31.3370 DL551 33638 -1.1346 FAM65B -7.1462 32.161515 1.269172
168111 chr6 24990939 GAAAAAA GAAAAAAA,GAAAAAAAA CPEB1 27.6798 DL551 33638 -1.1346 FAM65B -7.1462 28.609908 1.193778
10824 chr1 178698151 A G KLF14 14.4409 DL551 2884 -1.1481 RALGPS2 -8.2002 16.646351 1.089061
135786 chr2 198093455 G A FOXD2 13.2244 DL551 22922 0.3094 ANKRD44 -8.1858 15.555957 1.071183
149807 chr3 130614653 C A CEBPA 12.2880 DL135 28070 14.0897 ATP2C1 -1.0391 18.724164 1.070497
10692 chr1 178698151 A G ELF1 -12.2764 DL551 2884 -1.1481 RALGPS2 -8.2002 14.807816 1.065974
40525 chr11 102217809 C CCT SCRT2 34.0930 CB021314 7230 2.4329 BIRC3 -2.5774 34.276736 1.062359
10776 chr1 178698151 A G ETV1 -11.6755 DL551 2884 -1.1481 RALGPS2 -8.2002 14.313585 1.060150
149957 chr3 130614653 C A BCL6B 9.3465 DL135 28070 14.0897 ATP2C1 -1.0391 16.939789 1.044612
10800 chr1 178698151 A G ETV6 -8.9482 DL551 2884 -1.1481 RALGPS2 -8.2002 12.191460 1.037076
10728 chr1 178698151 A G ETV4 -8.6110 DL551 2884 -1.1481 RALGPS2 -8.2002 11.946160 1.034617
113172 chr19 34882472 A G,T ATF7 33.3535 FL313 19270 -3.9604 GPI -1.5128 33.621857 1.034472
149927 chr3 130614653 C A CEBPG 7.7879 DL135 28070 14.0897 ATP2C1 -1.0391 16.132289 1.033556
40527 chr11 102217809 C CCT SCRT2 34.0930 CB021314 7230 2.4329 BIRC2 -1.5596 34.215260 1.032467
40520 chr11 102217809 C CCT SCRT2 34.0930 FL313 7230 3.1783 TMEM123 -0.8938 34.252491 1.030905
40521 chr11 102217809 C CCT SCRT2 34.0930 FL313 7230 3.1783 BIRC2 0.8854 34.252273 1.030797
10836 chr1 178698151 A G ELF5 -8.0279 DL551 2884 -1.1481 RALGPS2 -8.2002 11.532935 1.030576
113171 chr19 34882472 A G,T ATF7 33.3535 FL313 19270 -3.9604 UBA2 -1.2825 33.612283 1.029834
10656 chr1 178698151 A G ZFX 7.8248 DL551 2884 -1.1481 RALGPS2 -8.2002 11.392494 1.029231
10716 chr1 178698151 A G FLI1 -7.7879 DL551 2884 -1.1481 RALGPS2 -8.2002 11.367181 1.028990
10668 chr1 178698151 A G RFX1 -7.7879 DL551 2884 -1.1481 RALGPS2 -8.2002 11.367181 1.028990
149777 chr3 130614653 C A ZNF350 -7.0335 DL135 28070 14.0897 ATP2C1 -1.0391 15.781936 1.028892
40519 chr11 102217809 C CCT SCRT2 34.0930 FL313 7230 3.1783 BIRC3 0.7117 34.248223 1.028794
135676 chr2 198093455 G A PITX3 8.4365 DL551 22922 0.3094 ANKRD44 -8.1858 11.759149 1.028692
149977 chr3 130614653 C A STAT2 6.6025 DL135 28070 14.0897 ATP2C1 -1.0391 15.594627 1.026432
113186 chr19 34882472 A G,T ATF7 33.3535 DL551 19270 1.1417 KIAA0355 -2.4354 33.461778 1.025602
113169 chr19 34882472 A G,T ATF7 33.3535 FL313 19270 -3.9604 PDCD2L -0.8958 33.599750 1.023734
10848 chr1 178698151 A G EHF -6.8805 DL551 2884 -1.1481 RALGPS2 -8.2002 10.765811 1.023411
149917 chr3 130614653 C A MAF -5.7718 DL135 28070 14.0897 ATP2C1 -1.0391 15.261489 1.022115
40526 chr11 102217809 C CCT SCRT2 34.0930 CB021314 7230 2.4329 TMEM123 0.9745 34.193586 1.021733
... ... ... ... ... ... ... ... ... ... ... ... ... ...
90169 chr17 5390697 A G TGIF1 -0.3025 FL153 15648 -0.0154 DHX33 -0.0331 0.304695 0.009809
146505 chr3 81810702 AGG AGGG SP4 0.0938 DL237 27480 0.1289 GBE1 -0.0165 0.160268 0.009763
149683 chr3 114160530 A G RFX1 0.0000 FL301 27815 -0.0684 ZBTB20 0.0690 0.097157 0.009714
149563 chr3 114160530 A G HIVEP2 0.0000 FL301 27815 -0.0684 ZBTB20 0.0690 0.097157 0.009714
2938 chr1 112051020 A C MYC -0.2573 FL255 1880 -0.0317 RAP1A -0.0440 0.262953 0.009529
146415 chr3 81810702 AGG AGGG ZFX 0.0000 DL237 27480 0.1289 GBE1 -0.0165 0.129952 0.009367
29643 chr11 65407855 A G NR2C2 -0.3052 CB021314 6705 -0.0042 KCNK7 0.0080 0.305334 0.009010
29647 chr11 65407855 A G NR2C2 -0.3052 CB021314 6705 -0.0042 PCNXL3 -0.0067 0.305302 0.008994
87472 chr17 4851072 A G CTCFL 0.2864 FL202 15627 0.0171 GP1BA -0.0243 0.287937 0.008990
18028 chr10 70095443 G A SPDEF 0.1046 FL301 4872 0.1167 PBLD -0.0121 0.157183 0.008955
159102 chr5 150558749 C T,G PBX2 0.1496 DL135 32607 -0.0275 CCDC69 -0.0593 0.163257 0.008681
79595 chr16 81846034 C G POU2F2 0.2234 FL153 15256 -0.0563 PLCG2 0.0318 0.232569 0.008599
163004 chr5 177547247 T C TFCP2 0.2190 FL153 33012 -0.0756 RMND5B 0.0121 0.231997 0.008499
79559 chr16 81846034 C G POU6F2 -0.1916 FL153 15256 -0.0563 PLCG2 0.0318 0.202216 0.007911
6943 chr1 157775634 T C NR1I2 0.2190 FL301 2504 -0.0301 CD5L -0.0246 0.222423 0.007404
2866 chr1 112051020 A C HES5 0.1537 FL255 1880 -0.0317 RAP1A -0.0440 0.162986 0.007361
73259 chr15 79235161 C G FOXA1 0.0468 FL255 13588 0.0001 CTSH 0.0591 0.075386 0.007337
6781 chr1 157775634 T C ZNF350 0.2151 FL301 2504 -0.0301 CD5L -0.0246 0.218584 0.007305
2650 chr1 112051020 A C HEY1 0.1415 FL255 1880 -0.0317 RAP1A -0.0440 0.151536 0.007147
87252 chr17 4851072 A G ZKSCAN1 0.2138 FL202 15627 0.0171 GP1BA -0.0243 0.215855 0.007041
141057 chr22 46677607 T C,G PRDM14 0.1986 FL153 26056 0.0041 PKDREJ 0.0036 0.198675 0.005849
69806 chr15 66081098 T C NR5A2 0.0820 FL174 13285 -0.0083 DENND4A 0.0430 0.092962 0.005799
163130 chr5 177547247 T C TP53 0.0556 FL153 33012 -0.0756 RMND5B 0.0121 0.094621 0.005799
81105 chr16 81849039 C T INSM1 0.0000 FL153 15256 -0.0186 PLCG2 0.0422 0.046117 0.005313
80934 chr16 81849039 C T PLAG1 0.0000 FL153 15256 -0.0186 PLCG2 0.0422 0.046117 0.005313
158506 chr5 142734742 T C NR3C1 -0.1659 FL202 32503 0.0216 NR3C1 -0.0064 0.167423 0.005161
191724 chr9 131444420 T C ZSCAN16 0.0000 FL174 41337 -0.0037 SPTAN1 -0.0418 0.041963 0.005104
88132 chr17 4851072 A G NR2F2 -0.0874 FL202 15627 0.0171 GP1BA -0.0243 0.092313 0.004102
32499 chr11 65407855 A G NR2F2 0.0154 CB021314 6705 -0.0042 KCNK7 0.0080 0.017855 0.001116
32503 chr11 65407855 A G NR2F2 0.0154 CB021314 6705 -0.0042 PCNXL3 -0.0067 0.017312 0.000980

195808 rows × 13 columns


In [55]:
data = [
    plotly.graph_objs.Histogram(
        x=vcf_table.distance_w
    )
]

layout = plotly.graph_objs.Layout(
    title='Scaled Score Distance Histogram',
    xaxis=dict(
        title='Scaled Score',
        ticklen=5,
        zeroline=False,
        gridwidth=2,
    ),
    yaxis={'title':'Number of Samples with Distance Score'}
    )
#plotly.offline.iplot(data)  # basic plot
go_figure = plotly.graph_objs.Figure(data=data, layout=layout)
plotly.offline.iplot(go_figure)



In [65]:
vcf_tabletopx = vcf_table[vcf_table.distance_w > .8]
vcf_tabletopx.size


Out[65]:
8710

In [66]:
trace1 = plotly.graph_objs.Scatter3d(
    name="Distances",
    x=vcf_tabletopx['LOG2_VAR-REF_SCORE'],
    y=vcf_tabletopx.ACT_ZSCORE,
    z=vcf_tabletopx.EXP_ZSCORE,
    mode='markers',
    marker=dict(
        size=4,                 # Using gene expression as color scale values for now.
        color=vcf_tabletopx.EXP_ZSCORE,                # set color to an array/list of desired values
        colorscale='Viridis',   # choose a colorscale
        opacity=0.8
    )
)

data = [trace1]
layout = plotly.graph_objs.Layout(
    margin=dict(
        l=0,
        r=0,
        b=0,
        t=0
    ),
    title='Distances from Average for Individual Variant Events',
    scene=dict(
        xaxis=dict(
            title='log2 Var/Ref Motif Log-Odds Ratio Difference',
            titlefont=dict(
                family='Courier New, monospace',
                size=18,
                color='#7f7f7f'
            )
        ),
        yaxis=dict(
            title='Enhancer Activity z-Score',
            titlefont=dict(
                family='Courier New, monospace',
                size=18,
                color='#7f7f7f'
            )
        ),
        zaxis=dict(
            title='Gene Expression z-Score',
            titlefont=dict(
                family='Courier New, monospace',
                size=18,
                color='#7f7f7f'
            )
        )
    )
)

go_figure = plotly.graph_objs.Figure(data=data, layout=layout)
plotly.offline.iplot(go_figure)



In [ ]: