Introduction

This notebook serves as a tutorial for using `gUtils` functions.

gUtils is an R package developed to extend the functionality of GenomicRanges, the R package which stores genomic data and represents genomic locations in the Bioconductor project

Load the necessary packages and data

Although it is not a required dependency, we recommmend that users download a BSGenome, the structure used to represent full genomes within Bioconductor packages.

Please install with `biocLite()`

### try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("BSgenome.Hsapiens.UCSC.hg19")

See the BSgenome documentation for more details: https://bioconductor.org/packages/release/data/annotation/html/BSgenome.Hsapiens.UCSC.hg19.html

Load gUtils



In [1]:

    
suppressWarnings(suppressMessages(library(gUtils)))

Load `BSgenome.Hsapiens.UCSC.hg19` and set with the variable `DEFAULT_BSGENOME` in your environment:



In [2]:

    
library(BSgenome.Hsapiens.UCSC.hg19)
Sys.setenv(DEFAULT_BSGENOME = 'BSgenome.Hsapiens.UCSC.hg19::Hsapiens')









    



Loading required package: BSgenome
Loading required package: Biostrings
Loading required package: XVector

Attaching package: ‘Biostrings’

The following object is masked from ‘package:base’:

    strsplit

Loading required package: rtracklayer

Create GRanges objects for examples



In [3]:

    
## instantiate an example GRanges objects, gr

gr = GRanges(1, IRanges(c(3, 9, 13), c(9, 13, 20)), strand=c('+','-','-'), name=c("A","B","C"))

print(gr)









    



GRanges object with 3 ranges and 1 metadata column:
      seqnames    ranges strand |        name
         <Rle> <IRanges>  <Rle> | <character>
  [1]        1  [ 3,  9]      + |           A
  [2]        1  [ 9, 13]      - |           B
  [3]        1  [13, 20]      - |           C
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths



In [4]:

    
## instantiate multiple GRanges objects (to be used in examples below)

gr1 = suppressWarnings(GRanges(
    seqnames = Rle(c(1, 2, 1, 3), c(1, 3, 2, 4)),
    ranges = IRanges(101:110, end = 111:120, names = head(letters, 10)),
    strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)),
    score = 1:10,
    GC = seq(1, 0, length=10)))

gr2 = GRanges(1, IRanges(c(103, 107, 113), c(105, 109, 116)), 
    strand=c('+','-','-'), 
    name=c("A","B","C"))

print(gr1)


print(gr2)









    



GRanges object with 10 ranges and 2 metadata columns:
    seqnames     ranges strand |     score                GC
       <Rle>  <IRanges>  <Rle> | <integer>         <numeric>
  a        1 [101, 111]      - |         1                 1
  b        2 [102, 112]      + |         2 0.888888888888889
  c        2 [103, 113]      + |         3 0.777777777777778
  d        2 [104, 114]      * |         4 0.666666666666667
  e        1 [105, 115]      * |         5 0.555555555555556
  f        1 [106, 116]      + |         6 0.444444444444444
  g        3 [107, 117]      + |         7 0.333333333333333
  h        3 [108, 118]      + |         8 0.222222222222222
  i        3 [109, 119]      - |         9 0.111111111111111
  j        3 [110, 120]      - |        10                 0
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
GRanges object with 3 ranges and 1 metadata column:
      seqnames     ranges strand |        name
         <Rle>  <IRanges>  <Rle> | <character>
  [1]        1 [103, 105]      + |           A
  [2]        1 [107, 109]      - |           B
  [3]        1 [113, 116]      - |           C
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Functions %-% and %+%



In [5]:

    
## Motivation:
## Adding integer N to a GRanges object with IRanges(start, end) 
## will result in a GRanges with IRanges(start - N, end + N)
gr + 5

## Subtracting integer N to a GRanges object with IRanges(start, end) 
## will result in a GRanges with IRanges(start + N, end - N)
gr - 2

## The functions '%+%' and '%-%' are designed to easily manipulate a GRanges object to add/substract bases 
## to 'start' and 'end'









    





GRanges object with 3 ranges and 1 metadata column:
      seqnames    ranges strand |        name
         <Rle> <IRanges>  <Rle> | <character>
  [1]        1  [-2, 14]      + |           A
  [2]        1  [ 4, 18]      - |           B
  [3]        1  [ 8, 25]      - |           C
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths






    





GRanges object with 3 ranges and 1 metadata column:
      seqnames    ranges strand |        name
         <Rle> <IRanges>  <Rle> | <character>
  [1]        1  [ 5,  7]      + |           A
  [2]        1  [11, 11]      - |           B
  [3]        1  [15, 18]      - |           C
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

%+% : shifts GRanges by N bases to the right



In [6]:

    
## Function: 
## '%+%' is an operator which shifts GRanges by N bases to the right, i.e. IRanges(start + N, end + N)
## Use:
## gr %+% N = original GRanges 'gr' shifted N bases to the right.



In [7]:

    
gr %+% 5









    





GRanges object with 3 ranges and 1 metadata column:
      seqnames    ranges strand |        name
         <Rle> <IRanges>  <Rle> | <character>
  [1]        1  [ 8, 14]      + |           A
  [2]        1  [14, 18]      - |           B
  [3]        1  [18, 25]      - |           C
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

%-% : shifts GRanges by N bases to the left



In [8]:

    
## Function: 
## '%-%' is an operator which shifts GRanges by N bases to the left, , i.e. IRanges(start - N, end - N)
## Use:
## gr %-% N = original GRanges 'gr' shifted N bases to the right.



In [9]:

    
gr %-% 5









    





GRanges object with 3 ranges and 1 metadata column:
      seqnames    ranges strand |        name
         <Rle> <IRanges>  <Rle> | <character>
  [1]        1  [-2,  4]      + |           A
  [2]        1  [ 4,  8]      - |           B
  [3]        1  [ 8, 15]      - |           C
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

%Q% : subsets the GRanges based on conditionals



In [10]:

    
## Function: 
## '%Q%' subsets the GRanges 'gr' based on conditionals of the metadata columns in 'gr'
## Use:
## gr %Q% (gene_names == 'KRAS')
## gr %Q% ((mapq >= 40) && (GC < 0.5))
## gr %Q% (width < 100)



In [11]:

    
gr %Q% (name != 'B')









    





GRanges object with 2 ranges and 1 metadata column:
      seqnames    ranges strand |        name
         <Rle> <IRanges>  <Rle> | <character>
  [1]        1  [ 3,  9]      + |           A
  [2]        1  [13, 20]      - |           C
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

%*% : Performs "natural join" or merge of metadata columns of two GRanges



In [12]:

    
## Function: 
## '%*%' Performs "natural join" or merge of metadata columns of grA and grB 
## using interval overlap as a "primary key", outputs a new GRanges whose maximum length is length(grA)*length(grB). 
## (See please see documentation for function 'gr.findoverlaps()' for more complex queries)
## Use:
## grA %*% grB   # strand-agnostic merging
## grA %**% grB  # strand-specific merging



In [13]:

    
## grA %*% grB  # strand-agnostic merging

gr1 %*% gr2   ## gr2 %*% gr1 will have metadata columns in a different order









    



Warning message:
“"ranges" method for Hits objects is deprecated. Please use
  overlapsRanges() instead.”





    





GRanges object with 7 ranges and 5 metadata columns:
      seqnames     ranges strand |  query.id subject.id     score        GC
         <Rle>  <IRanges>  <Rle> | <integer>  <integer> <integer> <numeric>
  [1]        1 [103, 105]      * |         1          1         1 1.0000000
  [2]        1 [105, 105]      * |         5          1         5 0.5555556
  [3]        1 [107, 109]      * |         1          2         1 1.0000000
  [4]        1 [107, 109]      * |         5          2         5 0.5555556
  [5]        1 [107, 109]      * |         6          2         6 0.4444444
  [6]        1 [113, 115]      * |         5          3         5 0.5555556
  [7]        1 [113, 116]      * |         6          3         6 0.4444444
             name
      <character>
  [1]           A
  [2]           A
  [3]           B
  [4]           B
  [5]           B
  [6]           C
  [7]           C
  -------
  seqinfo: 25 sequences from an unspecified genome



In [14]:

    
## grA %**% grB # strand-specific merging

gr1 %**% gr2









    



Warning message:
“"ranges" method for Hits objects is deprecated. Please use
  overlapsRanges() instead.”





    





GRanges object with 4 ranges and 5 metadata columns:
      seqnames     ranges strand |  query.id subject.id     score        GC
         <Rle>  <IRanges>  <Rle> | <integer>  <integer> <integer> <numeric>
  [1]        1 [107, 109]      - |         1          2         1 1.0000000
  [2]        1 [105, 105]      * |         5          1         5 0.5555556
  [3]        1 [107, 109]      * |         5          2         5 0.5555556
  [4]        1 [113, 115]      * |         5          3         5 0.5555556
             name
      <character>
  [1]           B
  [2]           A
  [3]           B
  [4]           C
  -------
  seqinfo: 25 sequences from an unspecified genome

%\$% : aggreagates metadata across GRanges

grA %\$% grB aggregates the metadata in grB across the territory of each range in grA



In [15]:

    
## Function: 
## '%$%' aggregates the metadata in grB across the territory of each range in grA  
## Aggregates the metadata in grB across the territory of each range in grA. 
## This returns grA appended with additional metadata columns of grA with values aggregated over where grA and grB overlap. 
## For character or factor-valued metadata columns of grB, aggregation will return a comma collapsed character value of all grB values that overlap grA[i].
## For numeric columns of grB it will return the width-weighted mean value of that column across the grA[i] and grB overlap. 
## Use:
## grA %$% grB   # strand-agnostic aggregation
## grA %$$% grB  # strand-specific aggregation



In [16]:

    
## grA %$% grB  # strand-agnostic aggregation

gr1 %$% gr2









    



Warning message:
“"ranges" method for Hits objects is deprecated. Please use
  overlapsRanges() instead.”





    





GRanges object with 10 ranges and 3 metadata columns:
    seqnames     ranges strand |     score        GC        name
       <Rle>  <IRanges>  <Rle> | <integer> <numeric> <character>
  a        1 [101, 111]      - |         1 1.0000000        A, B
  b        2 [102, 112]      + |         2 0.8888889            
  c        2 [103, 113]      + |         3 0.7777778            
  d        2 [104, 114]      * |         4 0.6666667            
  e        1 [105, 115]      * |         5 0.5555556     A, B, C
  f        1 [106, 116]      + |         6 0.4444444        B, C
  g        3 [107, 117]      + |         7 0.3333333            
  h        3 [108, 118]      + |         8 0.2222222            
  i        3 [109, 119]      - |         9 0.1111111            
  j        3 [110, 120]      - |        10 0.0000000            
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths



In [17]:

    
gr2 %$% gr1









    



Warning message:
“"ranges" method for Hits objects is deprecated. Please use
  overlapsRanges() instead.”





    





GRanges object with 3 ranges and 3 metadata columns:
      seqnames     ranges strand |        name     score        GC
         <Rle>  <IRanges>  <Rle> | <character> <numeric> <numeric>
  [1]        1 [103, 105]      + |           A  2.000000 0.8888889
  [2]        1 [107, 109]      - |           B  4.000000 0.6666667
  [3]        1 [113, 116]      - |           C  5.571429 0.4920635
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths



In [18]:

    
## grA %$$% grB  # strand-specific aggregation

gr1 %$$% gr2









    



Warning message:
“"ranges" method for Hits objects is deprecated. Please use
  overlapsRanges() instead.”





    





GRanges object with 10 ranges and 3 metadata columns:
    seqnames     ranges strand |     score        GC        name
       <Rle>  <IRanges>  <Rle> | <integer> <numeric> <character>
  a        1 [101, 111]      - |         1 1.0000000           B
  b        2 [102, 112]      + |         2 0.8888889            
  c        2 [103, 113]      + |         3 0.7777778            
  d        2 [104, 114]      * |         4 0.6666667            
  e        1 [105, 115]      * |         5 0.5555556     A, B, C
  f        1 [106, 116]      + |         6 0.4444444            
  g        3 [107, 117]      + |         7 0.3333333            
  h        3 [108, 118]      + |         8 0.2222222            
  i        3 [109, 119]      - |         9 0.1111111            
  j        3 [110, 120]      - |        10 0.0000000            
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths



In [19]:

    
gr2 %$$% gr1









    



Warning message:
“"ranges" method for Hits objects is deprecated. Please use
  overlapsRanges() instead.”





    





GRanges object with 3 ranges and 3 metadata columns:
      seqnames     ranges strand |        name     score        GC
         <Rle>  <IRanges>  <Rle> | <character> <numeric> <numeric>
  [1]        1 [103, 105]      + |           A         5 0.5555556
  [2]        1 [107, 109]      - |           B         3 0.7777778
  [3]        1 [113, 116]      - |           C         5 0.5555556
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

%&% : returns subset of overlapping GRanges

grA %&% grB returns the subset of ranges in grA that overlap with at least one range in grB



In [20]:

    
## Function: 
## '%&%' return the subset of ranges in grA that overlap with at least one range in grB.
## Use:
## grA %&% grB   # strand-agnostic 
## grA %&&% grB  # strand-specific



In [21]:

    
## grA %&% grB  # strand-agnostic 

gr1 %&% gr2









    



Warning message:
“"ranges" method for Hits objects is deprecated. Please use
  overlapsRanges() instead.”





    





GRanges object with 3 ranges and 2 metadata columns:
    seqnames     ranges strand |     score                GC
       <Rle>  <IRanges>  <Rle> | <integer>         <numeric>
  a        1 [101, 111]      - |         1                 1
  e        1 [105, 115]      * |         5 0.555555555555556
  f        1 [106, 116]      + |         6 0.444444444444444
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths



In [22]:

    
gr2 %&% gr1









    



Warning message:
“"ranges" method for Hits objects is deprecated. Please use
  overlapsRanges() instead.”





    





GRanges object with 3 ranges and 1 metadata column:
      seqnames     ranges strand |        name
         <Rle>  <IRanges>  <Rle> | <character>
  [1]        1 [103, 105]      + |           A
  [2]        1 [107, 109]      - |           B
  [3]        1 [113, 116]      - |           C
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths



In [23]:

    
## grA %&&% grB  # strand-specific 

gr1 %&&% gr2









    



Warning message:
“"ranges" method for Hits objects is deprecated. Please use
  overlapsRanges() instead.”





    





GRanges object with 2 ranges and 2 metadata columns:
    seqnames     ranges strand |     score                GC
       <Rle>  <IRanges>  <Rle> | <integer>         <numeric>
  a        1 [101, 111]      - |         1                 1
  e        1 [105, 115]      * |         5 0.555555555555556
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths



In [24]:

    
gr2 %&&% gr1









    



Warning message:
“"ranges" method for Hits objects is deprecated. Please use
  overlapsRanges() instead.”





    





GRanges object with 3 ranges and 1 metadata column:
      seqnames     ranges strand |        name
         <Rle>  <IRanges>  <Rle> | <character>
  [1]        1 [103, 105]      + |           A
  [2]        1 [107, 109]      - |           B
  [3]        1 [113, 116]      - |           C
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

%O% : returns number of bases in overlapping GRanges

grA %O% grB returns a length(grA) numeric vector whose item i is the fraction of the width of grA[i] which overlaps at least one range in grB.



In [25]:

    
## Function: 
## '%O%' 
## Returns a length(grA) numeric vector whose item i is the *fraction of the width* of grA[i] 
## which overlaps at least one range in grB.
## Use:
## grA %O% grB   # strand-agnostic
## grA %OO% grB  # strand-specific



In [26]:

    
## grA %O% grB   # strand-agnostic

print('gr1 %O% gr2 outputs')
print(gr1 %O% gr2)
print('')
print('gr2 %O% gr1 outputs')
print(gr2 %O% gr1)









    



[1] "gr1 %O% gr2 outputs"






    



Warning message:
“"ranges" method for Hits objects is deprecated. Please use
  overlapsRanges() instead.”





    



 [1] 0.5454545 0.0000000 0.0000000 0.0000000 0.6363636 0.6363636 0.0000000
 [8] 0.0000000 0.0000000 0.0000000
[1] ""
[1] "gr2 %O% gr1 outputs"






    



Warning message:
“"ranges" method for Hits objects is deprecated. Please use
  overlapsRanges() instead.”





    



[1] 1.333333 3.000000 1.750000



In [27]:

    
## grA %OO% grB   # strand-specific

print('gr1 %OO% gr2 outputs')
print(gr1 %OO% gr2)
print('')
print('gr2 %OO% gr1 outputs')
print(gr2 %OO% gr1)









    



[1] "gr1 %OO% gr2 outputs"






    



Warning message:
“"ranges" method for Hits objects is deprecated. Please use
  overlapsRanges() instead.”





    



 [1] 0.2727273 0.0000000 0.0000000 0.0000000 0.6363636 0.0000000 0.0000000
 [8] 0.0000000 0.0000000 0.0000000
[1] ""
[1] "gr2 %OO% gr1 outputs"






    



Warning message:
“"ranges" method for Hits objects is deprecated. Please use
  overlapsRanges() instead.”





    



[1] 0.3333333 2.0000000 0.7500000

%o% : returns width fraction in overlapping GRanges

grA %o% grB returns a length(grA) numeric vector whose item i is the number of bases in grA[i]



In [28]:

    
# %o%
## Function: 
## '%o%' 
## Returns a length(grA) numeric vector whose item i is the *number of bases* in grA[i] 
## which overlaps at least one range in grB. 
## Use:
## grA %o% grB   # strand-agnostic
## grA %oo% grB  # strand-specific



In [29]:

    
## grA %o% grB   # strand-agnostic

print('gr1 %o% gr2 outputs')
print(gr1 %o% gr2)
print('')
print('gr2 %o% gr1 outputs')
print(gr2 %o% gr1)









    



[1] "gr1 %o% gr2 outputs"






    



Warning message:
“"ranges" method for Hits objects is deprecated. Please use
  overlapsRanges() instead.”





    



 [1] 6 0 0 0 7 7 0 0 0 0
[1] ""
[1] "gr2 %o% gr1 outputs"






    



Warning message:
“"ranges" method for Hits objects is deprecated. Please use
  overlapsRanges() instead.”





    



[1] 4 9 7



In [30]:

    
## grA %oo% grB  # strand-specific

print('gr1 %oo% gr2 outputs')
print(gr1 %oo% gr2)
print('')
print('gr2 %oo% gr1 outputs')
print(gr2 %oo% gr1)









    



[1] "gr1 %oo% gr2 outputs"






    



Warning message:
“"ranges" method for Hits objects is deprecated. Please use
  overlapsRanges() instead.”





    



 [1] 3 0 0 0 7 0 0 0 0 0
[1] ""
[1] "gr2 %oo% gr1 outputs"






    



Warning message:
“"ranges" method for Hits objects is deprecated. Please use
  overlapsRanges() instead.”





    



[1] 1 6 3

%N% : returns total number of ranges in overlapping GRanges

grA %o% grB returns a length(grA) numeric vector whose item i is the total number of ranges in grB that overlap with grA[i]



In [31]:

    
## Function: 
## '%N%' 
## Returns a length(grA) numeric vector whose item i is the total number of ranges in grB that overlap with grA[i]
## Use:
## grA %N%  grB   # strand-agnostic
## grA %NN% grB  # strand-specific



In [32]:

    
## grA %N% grB   # strand-agnostic

print('gr1 %N% gr2 outputs')
print(gr1 %N% gr2)
print('')
print('gr2 %N% gr1 outputs')
print(gr2 %N% gr1)









    



[1] "gr1 %N% gr2 outputs"






    



Warning message:
“"ranges" method for Hits objects is deprecated. Please use
  overlapsRanges() instead.”





    



 [1] 2 0 0 0 3 2 0 0 0 0
[1] ""
[1] "gr2 %N% gr1 outputs"






    



Warning message:
“"ranges" method for Hits objects is deprecated. Please use
  overlapsRanges() instead.”





    



[1] 2 3 2



In [33]:

    
## grA %NN% grB   # strand-agnostic

print('gr1 %NN% gr2 outputs')
print(gr1 %NN% gr2)
print('')
print('gr2 %N% gr1 outputs')
print(gr2 %NN% gr1)









    



[1] "gr1 %NN% gr2 outputs"






    



Warning message:
“"ranges" method for Hits objects is deprecated. Please use
  overlapsRanges() instead.”





    



 [1] 1 0 0 0 3 0 0 0 0 0
[1] ""
[1] "gr2 %N% gr1 outputs"






    



Warning message:
“"ranges" method for Hits objects is deprecated. Please use
  overlapsRanges() instead.”





    



[1] 1 2 1

%^% : returns boolean vector TRUE in overlapping GRanges

grA %^% grB returns a length(grA) logical vector whose item i TRUE if the grA[i] overlap at least on range in grB



In [34]:

    
## Function: 
## '%^%' 
## Returns a length(grA) logical vector whose item i TRUE if the grA[i] overlap at least on range in grB 
## Use:
## grA %^%  grB  # strand-agnostic
## grA %^^% grB  # strand-specific



In [35]:

    
## grA %^%  grB  # strand-agnostic

print('gr1 %^% gr2 outputs')
print(gr1 %^% gr2)
print('')
print('gr2 %^% gr1 outputs')
print(gr2 %^% gr1)









    



[1] "gr1 %^% gr2 outputs"






    



Warning message:
“"ranges" method for Hits objects is deprecated. Please use
  overlapsRanges() instead.”





    



 [1]  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
[1] ""
[1] "gr2 %^% gr1 outputs"






    



Warning message:
“"ranges" method for Hits objects is deprecated. Please use
  overlapsRanges() instead.”





    



[1] TRUE TRUE TRUE



In [36]:

    
## grA %^^% grB  # strand-specific

print('gr1 %^^% gr2 outputs')
print(gr1 %^^% gr2)
print('')
print('gr2 %^^% gr1 outputs')
print(gr2 %^^% gr1)









    



[1] "gr1 %^^% gr2 outputs"






    



Warning message:
“"ranges" method for Hits objects is deprecated. Please use
  overlapsRanges() instead.”





    



 [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
[1] ""
[1] "gr2 %^^% gr1 outputs"






    



Warning message:
“"ranges" method for Hits objects is deprecated. Please use
  overlapsRanges() instead.”





    



[1] TRUE TRUE TRUE

%_% : returns GRanges representing setdiff in input GRanges



In [37]:

    
## Function: 
## '%_%' 
## Similar to the function BiocGenerics::setdiff(), but strand-agnostic (cf. https://github.com/Bioconductor/BiocGenerics)
## Use:
## grA %_% grB



In [38]:

    
print('gr1 %_% gr2 outputs')
print(suppressWarnings(gr1 %_% gr2))
print('')
print('gr2 %_% gr1 outputs')
print(suppressWarnings(gr2 %_% gr1))









    



[1] "gr1 %_% gr2 outputs"
GRanges object with 5 ranges and 0 metadata columns:
      seqnames     ranges strand
         <Rle>  <IRanges>  <Rle>
  [1]        1 [101, 102]      *
  [2]        1 [106, 106]      *
  [3]        1 [110, 112]      *
  [4]        2 [102, 114]      *
  [5]        3 [107, 120]      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
[1] ""
[1] "gr2 %_% gr1 outputs"
GRanges object with 0 ranges and 0 metadata columns:
   seqnames    ranges strand
      <Rle> <IRanges>  <Rle>
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

gUtils internal data and functions to load internal data

gUtils contains several internal data variable which load

hg_seqlengths()

returns standard hg19 seqlengths as an integer vector



In [39]:

    
# hg_seqlengths()
## Function: 'hg_seqlengths()'
## Details:
## Outputs standard hg19 seqlengths as an integer vector



In [40]:

    
hg_seqlengths()[1:6]

example_genes

returns GRanges of RefSeq genes with exon count and names, using hg19



In [41]:

    
# example_genes
## Data: 'example_genes', GRanges
## Details:
## GRanges of RefSeq genes with exon count and names, using hg19



In [42]:

    
head(example_genes)

length(example_genes)









    





GRanges object with 6 ranges and 2 metadata columns:
      seqnames           ranges strand |        name exonCount
         <Rle>        <IRanges>  <Rle> | <character> <integer>
  [1]        1 [ 69090,  70008]      + |       OR4F5         1
  [2]        1 [367658, 368597]      + |      OR4F16         1
  [3]        1 [367658, 368597]      + |       OR4F3         1
  [4]        1 [367658, 368597]      + |      OR4F29         1
  [5]        1 [861321, 879533]      + |      SAMD11        14
  [6]        1 [896073, 900571]      + |      KLHL17        12
  -------
  seqinfo: 25 sequences from hg19 genome






    




18812

example_dnase

returns DNAaseI hypersensitivity sites from UCSC Table Browser hg19, subsampled to 10,000 sites



In [43]:

    
# example_dnase
## Data: 'example_dnase', GRanges
## Details:
## DNAaseI hypersensitivity sites from UCSC Table Browser hg19,
## subsampled to 10,000 sites



In [44]:

    
head(example_dnase)

length(example_dnase)









    





GRanges object with 6 ranges and 2 metadata columns:
      seqnames             ranges strand | signalValue    pValue
         <Rle>          <IRanges>  <Rle> |   <numeric> <numeric>
  [1]        1 [ 567461,  567893]      * |     50.4687   282.177
  [2]        1 [ 848076,  848474]      * |     9.15261    12.211
  [3]        1 [ 860836,  861798]      * |     23.4447   53.1665
  [4]        1 [1172453, 1172732]      * |     2.74806   2.10225
  [5]        1 [1210156, 1210484]      * |     2.20956   1.60014
  [6]        1 [1221000, 1221366]      * |     5.70984   5.91024
  -------
  seqinfo: 25 sequences from hg19 genome






    




10000

grl1

returns SV rearrangements, first dataset



In [45]:

    
# grl1
## Data: 'grl1', GRangesList
## Details:
## simulated SV rearrangements, first dataset



In [46]:

    
head(grl1, 3)









    





GRangesList object of length 3:
$10 
GRanges object with 2 ranges and 1 metadata column:
      seqnames                 ranges strand |       bin
         <Rle>              <IRanges>  <Rle> | <integer>
  [1]        5 [146981323, 146981323]      - |        10
  [2]       14 [104605059, 104605059]      + |        10

$26 
GRanges object with 2 ranges and 1 metadata column:
      seqnames                 ranges strand | bin
  [1]        9 [140100229, 140100229]      - |  26
  [2]       19 [ 24309057,  24309057]      - |  26

$49 
GRanges object with 2 ranges and 1 metadata column:
      seqnames                 ranges strand | bin
  [1]        5 [132216750, 132216750]      - |  49
  [2]       17 [  4803057,   4803057]      + |  49

-------
seqinfo: 25 sequences from an unspecified genome

grl2

returns SV rearrangements, second dataset



In [47]:

    
# grl2
## Data: 'grl2', GRangesList
## Details:
## simulated SV rearrangements, second dataset



In [48]:

    
head(grl2, 3)









    





GRangesList object of length 3:
$4678 
GRanges object with 2 ranges and 1 metadata column:
      seqnames                 ranges strand |       bin
         <Rle>              <IRanges>  <Rle> | <integer>
  [1]        8 [110374809, 110374809]      - |      4678
  [2]       17 [ 72874459,  72874459]      + |      4678

$4713 
GRanges object with 2 ranges and 1 metadata column:
      seqnames                 ranges strand |  bin
  [1]        7 [116166641, 116166641]      + | 4713
  [2]       22 [ 21738147,  21738147]      - | 4713

$4723 
GRanges object with 2 ranges and 1 metadata column:
      seqnames                 ranges strand |  bin
  [1]        6 [ 32822006,  32822006]      + | 4723
  [2]        9 [136529010, 136529010]      - | 4723

-------
seqinfo: 25 sequences from an unspecified genome

si

returns Seqinfo object for hg19



In [49]:

    
# si
## Data: si, Seqinfo object
## Details:
## Seqinfo object for hg19

si









    





Seqinfo object with 25 sequences from hg19 genome:
  seqnames seqlengths isCircular genome
  1         249250621      FALSE   hg19
  2         243199373      FALSE   hg19
  3         198022430      FALSE   hg19
  4         191154276      FALSE   hg19
  5         180915260      FALSE   hg19
  ...             ...        ...    ...
  21         48129895      FALSE   hg19
  22         51304566      FALSE   hg19
  X         155270560      FALSE   hg19
  Y          59373566      FALSE   hg19
  M             16571      FALSE   hg19

grl.hiC

HiC data for chr14 from Lieberman-Aiden 2009 (in hg19), subsampled to 10,000 interactions



In [50]:

    
# grl.hiC
## Data: 'grl.hiC', GRangesList
## Details:
## HiC data for chr14 from Lieberman-Aiden 2009 (in hg19), subsampled
## to 10,000 interactions

head(grl.hiC)









    





GRangesList object of length 6:
$462760 
GRanges object with 2 ranges and 1 metadata column:
      seqnames               ranges strand |        id
         <Rle>            <IRanges>  <Rle> | <integer>
  [1]       14 [72176266, 72176266]      - |    462760
  [2]       14 [72422406, 72422406]      - |    462760

$40284 
GRanges object with 2 ranges and 1 metadata column:
      seqnames               ranges strand |    id
  [1]       14 [74904243, 74904243]      + | 40284
  [2]       14 [74905303, 74905303]      - | 40284

$240518 
GRanges object with 2 ranges and 1 metadata column:
      seqnames               ranges strand |     id
  [1]       14 [23569934, 23569934]      + | 240518
  [2]       14 [96453342, 96453342]      + | 240518

...
<3 more elements>
-------
seqinfo: 25 sequences from an unspecified genome



In [ ]:



In [ ]:

Introduction

This notebook serves as a tutorial for using gUtils functions.

gUtils is an R package developed to extend the functionality of GenomicRanges, the R package which stores genomic data and represents genomic locations in the Bioconductor project

Load the necessary packages and data

Although it is not a required dependency, we recommmend that users download a BSGenome, the structure used to represent full genomes within Bioconductor packages.

Please install with biocLite()

See the BSgenome documentation for more details: https://bioconductor.org/packages/release/data/annotation/html/BSgenome.Hsapiens.UCSC.hg19.html

Load gUtils

Load BSgenome.Hsapiens.UCSC.hg19 and set with the variable DEFAULT_BSGENOME in your environment:

Create GRanges objects for examples

Functions %-% and %+%

%+% : shifts GRanges by N bases to the right

%-% : shifts GRanges by N bases to the left

%Q% : subsets the GRanges based on conditionals

%*% : Performs "natural join" or merge of metadata columns of two GRanges

%\$% : aggreagates metadata across GRanges

grA %\$% grB aggregates the metadata in grB across the territory of each range in grA

%&% : returns subset of overlapping GRanges

grA %&% grB returns the subset of ranges in grA that overlap with at least one range in grB

%O% : returns number of bases in overlapping GRanges

grA %O% grB returns a length(grA) numeric vector whose item i is the fraction of the width of grA[i] which overlaps at least one range in grB.

%o% : returns width fraction in overlapping GRanges

grA %o% grB returns a length(grA) numeric vector whose item i is the number of bases in grA[i]

%N% : returns total number of ranges in overlapping GRanges

grA %o% grB returns a length(grA) numeric vector whose item i is the total number of ranges in grB that overlap with grA[i]

%^% : returns boolean vector TRUE in overlapping GRanges

grA %^% grB returns a length(grA) logical vector whose item i TRUE if the grA[i] overlap at least on range in grB

%_% : returns GRanges representing setdiff in input GRanges

gUtils internal data and functions to load internal data

gUtils contains several internal data variable which load

hg_seqlengths()

returns standard hg19 seqlengths as an integer vector

example_genes

returns GRanges of RefSeq genes with exon count and names, using hg19

example_dnase

returns DNAaseI hypersensitivity sites from UCSC Table Browser hg19, subsampled to 10,000 sites

grl1

returns SV rearrangements, first dataset

grl2

returns SV rearrangements, second dataset

si

returns Seqinfo object for hg19

grl.hiC

HiC data for chr14 from Lieberman-Aiden 2009 (in hg19), subsampled to 10,000 interactions

This notebook serves as a tutorial for using `gUtils` functions.

Please install with `biocLite()`

Load `BSgenome.Hsapiens.UCSC.hg19` and set with the variable `DEFAULT_BSGENOME` in your environment: