Julia tutorial with SeqArray files



In [1]:

    
# load the JSeqArray package
using JSeqArray

Open an existing SeqArray file and display its structure. The dimensions of sample.id and variant.id tell you the total numbers of samples and variants, i.e., 1092 samples and 19,773 variants.



In [2]:

    
# get the file name of example data
fn = seqExample(:kg)
f = seqOpen(fn)









    Out[2]:





SeqArray File: /home/juser/.julia/v0.5/JSeqArray/demo/data/1KG_phase1_release_v3_chr22.gds (0B)
+    [  ] *
|--+ description   [  ] *
|--+ sample.id   { Str8 1092 LZMA_ra(10.5%), 914B } *
|--+ variant.id   { Int32 19773 LZMA_ra(8.39%), 6.6K } *
|--+ position   { Int32 19773 LZMA_ra(52.0%), 41.1K } *
|--+ chromosome   { Str8 19773 LZMA_ra(0.28%), 166B } *
|--+ allele   { Str8 19773 LZMA_ra(22.7%), 111.9K } *
|--+ genotype   [  ] *
|  |--+ data   { Bit2 2x1092x19773 LZMA_ra(8.17%), 882.5K } *
|  |--+ extra.index   { Int32 3x0 LZMA_ra, 19B } *
|  \--+ extra   { Int16 0 LZMA_ra, 19B }
|--+ phase   [  ]
|  |--+ data   { Bit1 1092x19773 LZMA_ra(0.02%), 550B } *
|  |--+ extra.index   { Int32 3x0 LZMA_ra, 19B } *
|  \--+ extra   { Bit1 0 LZMA_ra, 19B }
|--+ annotation   [  ]
|  |--+ id   { Str8 19773 LZMA_ra(35.2%), 77.0K } *
|  |--+ qual   { Float32 19773 LZMA_ra(3.62%), 2.9K } *
|  |--+ filter   { Int32,factor 19773 LZMA_ra(0.21%), 170B } *
|  |--+ info   [  ]
|  \--+ format   [  ]
\--+ sample.annotation   [  ]
   |--+ Family.ID   { Str8 1092 LZMA_ra(15.3%), 1.1K }
   |--+ Population   { Str8 1092 LZMA_ra(5.08%), 222B }
   |--+ Gender   { Str8 1092 LZMA_ra(5.85%), 386B }
   \--+ Ancestry   { Str8 1092 LZMA_ra(2.43%), 233B }

Genotypic data and annotations are stored in an array-oriented manner, providing efficient data access using the Julia programming language. seqSetFilter() and seqGetData() can be used together to retrieve data for a selected set of samples from a defined genomic region. seqApply() applies a user-defined function to array margins of genotypes and annotations.



In [3]:

    
sampid = seqGetData(f, "sample.id")  # a list of sample IDs









    Out[3]:





1092-element Array{String,1}:
 "HG00096"
 "HG00097"
 "HG00099"
 "HG00100"
 "HG00101"
 "HG00102"
 "HG00103"
 "HG00104"
 "HG00106"
 "HG00108"
 "HG00109"
 "HG00110"
 "HG00111"
 ⋮        
 "NA20809"
 "NA20810"
 "NA20811"
 "NA20812"
 "NA20813"
 "NA20814"
 "NA20815"
 "NA20816"
 "NA20818"
 "NA20819"
 "NA20826"
 "NA20828"



In [4]:

    
varid = seqGetData(f, "variant.id")  # a list of variant IDs









    Out[4]:





19773-element Array{Int32,1}:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
     ⋮
 19762
 19763
 19764
 19765
 19766
 19767
 19768
 19769
 19770
 19771
 19772
 19773



In [5]:

    
seqGetData(f, "annotation/qual")









    Out[5]:





19773-element Array{Float32,1}:
 100.0
 100.0
 100.0
 100.0
 100.0
 100.0
 100.0
 247.0
  49.0
 100.0
 100.0
 100.0
 100.0
   ⋮  
 280.0
 100.0
 100.0
 100.0
 100.0
 100.0
 100.0
 100.0
 100.0
 100.0
 100.0
 100.0



In [6]:

    
seqGetData(f, "annotation/filter")









    Out[6]:





19773-element Array{String,1}:
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 ⋮     
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"

Define a subset of samples and variants



In [7]:

    
seqFilterSet(f, sample_id=sampid[1:4], variant_id=varid[1:6])









    



# of selected samples: 4
# of selected variants: 6

Get data from the subset of variant data.



In [8]:

    
seqGetData(f, "chromosome")









    Out[8]:





6-element Array{String,1}:
 "22"
 "22"
 "22"
 "22"
 "22"
 "22"



In [9]:

    
seqGetData(f, "allele")









    Out[9]:





6-element Array{String,1}:
 "A,G"
 "G,A"
 "G,T"
 "C,T"
 "G,A"
 "G,A"



In [10]:

    
# 0: the reference allele, 1: the first alternative allele, 0xFF: missing value
geno = seqGetData(f, "genotype")









    Out[10]:





2×4×6 Array{UInt8,3}:
[:, :, 1] =
 0x00  0x01  0x01  0x00
 0x00  0x00  0x00  0x01

[:, :, 2] =
 0x00  0x00  0x00  0x00
 0x00  0x01  0x01  0x00

[:, :, 3] =
 0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00

[:, :, 4] =
 0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00

[:, :, 5] =
 0x00  0x01  0x00  0x00
 0x00  0x00  0x01  0x00

[:, :, 6] =
 0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00



In [11]:

    
dosage = seqGetData(f, "#dosage")  # 0xFF: missing value









    Out[11]:





4×6 Array{UInt8,2}:
 0x02  0x02  0x02  0x02  0x02  0x02
 0x01  0x01  0x02  0x02  0x01  0x02
 0x01  0x01  0x02  0x02  0x01  0x02
 0x01  0x02  0x02  0x02  0x02  0x02

Calculation of Reference Allele Frequencies



In [12]:

    
seqFilterReset(f)  # reset the filter









    



# of selected samples: 1,092
# of selected variants: 19,773



In [13]:

    
af = seqApply(f, "genotype", asis=:unlist) do geno::Array{UInt8,3}
    N = size(geno, 3)
    rv = Vector{Float64}(N)
    for k in 1:N
        sum = 0; n = 0
        for g in geno[:,:,k]
            if g != 0xFF
                sum += g == 0  # 0 is the reference allele
                n += 1
            end
        end
        rv[k] = sum / n
    end
    return rv
end









    



[==================================================] 100%, completed in 0s






    Out[13]:





19773-element Array{Float64,1}:
 0.695055   
 0.943223   
 0.999542   
 0.999542   
 0.938645   
 0.999084   
 0.000457875
 0.988095   
 0.969322   
 0.989011   
 0.990385   
 0.947802   
 0.998168   
 ⋮          
 0.91163    
 0.999084   
 0.999542   
 0.992216   
 0.981685   
 0.981685   
 0.952839   
 0.994048   
 0.999542   
 0.996795   
 0.658425   
 0.913462

Principal Component Analysis



In [14]:

    
seqFilterReset(f)  # reset the filter









    



# of selected samples: 1,092
# of selected variants: 19,773



In [15]:

    
# initialize the covariance matrix
ss = Any[0.0]

seqApply(f, "#dosage", ss) do geno::Matrix{UInt8}, ss::Vector{Any}
    # calculate allele frequencies
    N = size(geno, 2); af = Vector{Float64}(N)
    for i in 1:N
        g = geno[:,i]; af[i] = mean(g[g .!= 0xFF])
    end
    af *= 0.5
    # normalized by allele frequencies
    g = Matrix{Float64}(size(geno))
    for i in 1:N
        g[:,i] = (geno[:,i] - 2*af[i]) / sqrt(af[i] * (1 - af[i]))
    end
    # correct missing genotypes
    g[isnan(g)] = 0.0; g[geno .== 0xFF] = 0.0
    # update the cov matrix
    ss[1] += g * g'
end

# scaled
ss[1] *= size(ss[1], 1) / sum(diag(ss[1]))
ss[1]









    



[==================================================] 100%, completed in 0s






    Out[15]:





1092×1092 Array{Float64,2}:
  0.7017      0.00830316   0.0511373   …  0.0197863   0.0513473   0.012317  
  0.00830316  0.622809     0.033266       0.0271662   0.0278968   0.0235471 
  0.0511373   0.033266     0.773402       0.0101346   0.0304234   0.0244011 
  0.0330731   0.0481347    0.0231505      0.0238398   0.0237653   0.0206459 
  0.0158933   0.0377345    0.0336195      0.00450558  0.0250328   0.022167  
  0.043194    0.0221012    0.0484443   …  0.0144089   0.0330417   0.0256502 
  0.0466153   0.0219024    0.030456       0.0242625   0.00740767  0.0242048 
  0.0108151   0.0258773    0.035232       0.0158535   0.0290021   0.0233435 
  0.0331725   0.0301866    0.0309742      0.0184776   0.0142575   0.0302456 
  0.0272684   0.00494007   0.00776384     0.0196036   0.0113146   0.0267043 
  0.0339346   0.020939     0.0347542   …  0.0162531   0.0106022   0.028762  
  0.0248709   0.0224371    0.0384413      0.0232937   0.0151097   0.0117061 
  0.0265906   0.0344899    0.0139562      0.017142    0.0255764   0.0154684 
  ⋮                                    ⋱              ⋮                     
  0.0189681   0.0193366    0.00727019  …  0.014465    0.0263022   0.020402  
  0.0190682   0.000678911  0.0322396      0.0236176   0.0233977   0.0201988 
 -0.00441231  0.0105124    0.0108247      0.0312933   0.0182148   0.00521477
  0.0373295   0.0112624    0.0174922      0.0209972   0.0507611   0.0142456 
  0.00217496  0.0315914    0.0222844      0.0143468   0.0281275   0.00932025
  0.0313247   0.0282472    0.00793672  …  0.0130704   0.0200056   0.0152904 
  0.0418121   0.0261307    0.0114143      0.0068133   0.0244468   0.0211374 
  0.0249877   0.0205697    0.0247884      0.0248403   0.0171583   0.0355916 
  0.0419023   0.0293111    0.0112945      0.0293121   0.0461871   0.0148807 
  0.0197863   0.0271662    0.0101346      0.922018    0.0224457   0.0109136 
  0.0513473   0.0278968    0.0304234   …  0.0224457   0.631799    0.00977053
  0.012317    0.0235471    0.0244011      0.0109136   0.00977053  0.979549



In [16]:

    
# eigen-decomposition
(w, v) = eig(-ss[1])

-w  # eigenvalues









    Out[16]:





1092-element Array{Float64,1}:
 39.6052     
 16.5196     
  5.18983    
  4.78401    
  4.59256    
  4.48968    
  4.07638    
  4.00637    
  3.89435    
  3.6689     
  3.55935    
  3.37597    
  3.2553     
  ⋮          
  0.240203   
  0.236452   
  0.23431    
  0.231839   
  0.218114   
  0.215627   
  0.198077   
  0.18193    
  0.129535   
  0.0526874  
  0.0512418  
 -3.06768e-14



In [17]:

    
v[:,1]  # the first eigenvector









    Out[17]:





1092-element Array{Float64,1}:
 -0.0147124
 -0.0154688
 -0.0135991
 -0.0151747
 -0.0146118
 -0.0122881
 -0.0136388
 -0.0135484
 -0.0150387
 -0.0134909
 -0.0138333
 -0.0133675
 -0.0135584
  ⋮        
 -0.0111869
 -0.0132867
 -0.0128998
 -0.0133382
 -0.0121674
 -0.0130977
 -0.0139683
 -0.0140026
 -0.0163832
 -0.0139423
 -0.0127312
 -0.0121628



In [18]:

    
# load the Gadfly package for plotting
using Gadfly









    



WARNING: Method definition describe(AbstractArray) in module StatsBase at /opt/julia_packages/.julia/v0.5/StatsBase/src/scalarstats.jl:573 overwritten in module DataFrames at /opt/julia_packages/.julia/v0.5/DataFrames/src/abstractdataframe/abstractdataframe.jl:407.



In [19]:

    
using StatsBase

# get population information
pop = seqGetData(f, "sample.annotation/Population")
countmap(pop)









    Out[19]:





Dict{String,Int64} with 14 entries:
  "CHS" => 100
  "PUR" => 55
  "CLM" => 60
  "FIN" => 93
  "TSI" => 98
  "LWK" => 97
  "JPT" => 89
  "MXL" => 66
  "CEU" => 85
  "YRI" => 88
  "ASW" => 61
  "CHB" => 97
  "IBS" => 14
  "GBR" => 89



In [20]:

    
# get ancestry information
ancestry = seqGetData(f, "sample.annotation/Ancestry")
countmap(ancestry)









    Out[20]:





Dict{String,Int64} with 4 entries:
  "EastAsia"     => 286
  "Europe"       => 379
  "SouthAmerica" => 181
  "Africa"       => 246



In [21]:

    
# add colors
plot(x=v[:,1], y=v[:,2], color=ancestry, Geom.point, Guide.xlabel("PC 1"), Guide.ylabel("PC 2"))









    Out[21]: