Julia tutorial with SeqArray files


In [1]:
# load the JSeqArray package
using JSeqArray

Open an existing SeqArray file and display its structure. The dimensions of sample.id and variant.id tell you the total numbers of samples and variants, i.e., 1092 samples and 19,773 variants.


In [2]:
# get the file name of example data
fn = seqExample(:kg)
f = seqOpen(fn)


Out[2]:
SeqArray File: /home/juser/.julia/v0.5/JSeqArray/demo/data/1KG_phase1_release_v3_chr22.gds (0B)
+    [  ] *
|--+ description   [  ] *
|--+ sample.id   { Str8 1092 LZMA_ra(10.5%), 914B } *
|--+ variant.id   { Int32 19773 LZMA_ra(8.39%), 6.6K } *
|--+ position   { Int32 19773 LZMA_ra(52.0%), 41.1K } *
|--+ chromosome   { Str8 19773 LZMA_ra(0.28%), 166B } *
|--+ allele   { Str8 19773 LZMA_ra(22.7%), 111.9K } *
|--+ genotype   [  ] *
|  |--+ data   { Bit2 2x1092x19773 LZMA_ra(8.17%), 882.5K } *
|  |--+ extra.index   { Int32 3x0 LZMA_ra, 19B } *
|  \--+ extra   { Int16 0 LZMA_ra, 19B }
|--+ phase   [  ]
|  |--+ data   { Bit1 1092x19773 LZMA_ra(0.02%), 550B } *
|  |--+ extra.index   { Int32 3x0 LZMA_ra, 19B } *
|  \--+ extra   { Bit1 0 LZMA_ra, 19B }
|--+ annotation   [  ]
|  |--+ id   { Str8 19773 LZMA_ra(35.2%), 77.0K } *
|  |--+ qual   { Float32 19773 LZMA_ra(3.62%), 2.9K } *
|  |--+ filter   { Int32,factor 19773 LZMA_ra(0.21%), 170B } *
|  |--+ info   [  ]
|  \--+ format   [  ]
\--+ sample.annotation   [  ]
   |--+ Family.ID   { Str8 1092 LZMA_ra(15.3%), 1.1K }
   |--+ Population   { Str8 1092 LZMA_ra(5.08%), 222B }
   |--+ Gender   { Str8 1092 LZMA_ra(5.85%), 386B }
   \--+ Ancestry   { Str8 1092 LZMA_ra(2.43%), 233B }

Genotypic data and annotations are stored in an array-oriented manner, providing efficient data access using the Julia programming language. seqSetFilter() and seqGetData() can be used together to retrieve data for a selected set of samples from a defined genomic region. seqApply() applies a user-defined function to array margins of genotypes and annotations.


In [3]:
sampid = seqGetData(f, "sample.id")  # a list of sample IDs


Out[3]:
1092-element Array{String,1}:
 "HG00096"
 "HG00097"
 "HG00099"
 "HG00100"
 "HG00101"
 "HG00102"
 "HG00103"
 "HG00104"
 "HG00106"
 "HG00108"
 "HG00109"
 "HG00110"
 "HG00111"
 ⋮        
 "NA20809"
 "NA20810"
 "NA20811"
 "NA20812"
 "NA20813"
 "NA20814"
 "NA20815"
 "NA20816"
 "NA20818"
 "NA20819"
 "NA20826"
 "NA20828"

In [4]:
varid = seqGetData(f, "variant.id")  # a list of variant IDs


Out[4]:
19773-element Array{Int32,1}:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
     ⋮
 19762
 19763
 19764
 19765
 19766
 19767
 19768
 19769
 19770
 19771
 19772
 19773

In [5]:
seqGetData(f, "annotation/qual")


Out[5]:
19773-element Array{Float32,1}:
 100.0
 100.0
 100.0
 100.0
 100.0
 100.0
 100.0
 247.0
  49.0
 100.0
 100.0
 100.0
 100.0
   ⋮  
 280.0
 100.0
 100.0
 100.0
 100.0
 100.0
 100.0
 100.0
 100.0
 100.0
 100.0
 100.0

In [6]:
seqGetData(f, "annotation/filter")


Out[6]:
19773-element Array{String,1}:
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 ⋮     
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"
 "PASS"

Define a subset of samples and variants


In [7]:
seqFilterSet(f, sample_id=sampid[1:4], variant_id=varid[1:6])


# of selected samples: 4
# of selected variants: 6

Get data from the subset of variant data.


In [8]:
seqGetData(f, "chromosome")


Out[8]:
6-element Array{String,1}:
 "22"
 "22"
 "22"
 "22"
 "22"
 "22"

In [9]:
seqGetData(f, "allele")


Out[9]:
6-element Array{String,1}:
 "A,G"
 "G,A"
 "G,T"
 "C,T"
 "G,A"
 "G,A"

In [10]:
# 0: the reference allele, 1: the first alternative allele, 0xFF: missing value
geno = seqGetData(f, "genotype")


Out[10]:
2×4×6 Array{UInt8,3}:
[:, :, 1] =
 0x00  0x01  0x01  0x00
 0x00  0x00  0x00  0x01

[:, :, 2] =
 0x00  0x00  0x00  0x00
 0x00  0x01  0x01  0x00

[:, :, 3] =
 0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00

[:, :, 4] =
 0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00

[:, :, 5] =
 0x00  0x01  0x00  0x00
 0x00  0x00  0x01  0x00

[:, :, 6] =
 0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00

In [11]:
dosage = seqGetData(f, "#dosage")  # 0xFF: missing value


Out[11]:
4×6 Array{UInt8,2}:
 0x02  0x02  0x02  0x02  0x02  0x02
 0x01  0x01  0x02  0x02  0x01  0x02
 0x01  0x01  0x02  0x02  0x01  0x02
 0x01  0x02  0x02  0x02  0x02  0x02

Calculation of Reference Allele Frequencies


In [12]:
seqFilterReset(f)  # reset the filter


# of selected samples: 1,092
# of selected variants: 19,773

In [13]:
af = seqApply(f, "genotype", asis=:unlist) do geno::Array{UInt8,3}
    N = size(geno, 3)
    rv = Vector{Float64}(N)
    for k in 1:N
        sum = 0; n = 0
        for g in geno[:,:,k]
            if g != 0xFF
                sum += g == 0  # 0 is the reference allele
                n += 1
            end
        end
        rv[k] = sum / n
    end
    return rv
end


[==================================================] 100%, completed in 0s
Out[13]:
19773-element Array{Float64,1}:
 0.695055   
 0.943223   
 0.999542   
 0.999542   
 0.938645   
 0.999084   
 0.000457875
 0.988095   
 0.969322   
 0.989011   
 0.990385   
 0.947802   
 0.998168   
 ⋮          
 0.91163    
 0.999084   
 0.999542   
 0.992216   
 0.981685   
 0.981685   
 0.952839   
 0.994048   
 0.999542   
 0.996795   
 0.658425   
 0.913462   

Principal Component Analysis


In [14]:
seqFilterReset(f)  # reset the filter


# of selected samples: 1,092
# of selected variants: 19,773

In [15]:
# initialize the covariance matrix
ss = Any[0.0]

seqApply(f, "#dosage", ss) do geno::Matrix{UInt8}, ss::Vector{Any}
    # calculate allele frequencies
    N = size(geno, 2); af = Vector{Float64}(N)
    for i in 1:N
        g = geno[:,i]; af[i] = mean(g[g .!= 0xFF])
    end
    af *= 0.5
    # normalized by allele frequencies
    g = Matrix{Float64}(size(geno))
    for i in 1:N
        g[:,i] = (geno[:,i] - 2*af[i]) / sqrt(af[i] * (1 - af[i]))
    end
    # correct missing genotypes
    g[isnan(g)] = 0.0; g[geno .== 0xFF] = 0.0
    # update the cov matrix
    ss[1] += g * g'
end

# scaled
ss[1] *= size(ss[1], 1) / sum(diag(ss[1]))
ss[1]


[==================================================] 100%, completed in 0s
Out[15]:
1092×1092 Array{Float64,2}:
  0.7017      0.00830316   0.0511373   …  0.0197863   0.0513473   0.012317  
  0.00830316  0.622809     0.033266       0.0271662   0.0278968   0.0235471 
  0.0511373   0.033266     0.773402       0.0101346   0.0304234   0.0244011 
  0.0330731   0.0481347    0.0231505      0.0238398   0.0237653   0.0206459 
  0.0158933   0.0377345    0.0336195      0.00450558  0.0250328   0.022167  
  0.043194    0.0221012    0.0484443   …  0.0144089   0.0330417   0.0256502 
  0.0466153   0.0219024    0.030456       0.0242625   0.00740767  0.0242048 
  0.0108151   0.0258773    0.035232       0.0158535   0.0290021   0.0233435 
  0.0331725   0.0301866    0.0309742      0.0184776   0.0142575   0.0302456 
  0.0272684   0.00494007   0.00776384     0.0196036   0.0113146   0.0267043 
  0.0339346   0.020939     0.0347542   …  0.0162531   0.0106022   0.028762  
  0.0248709   0.0224371    0.0384413      0.0232937   0.0151097   0.0117061 
  0.0265906   0.0344899    0.0139562      0.017142    0.0255764   0.0154684 
  ⋮                                    ⋱              ⋮                     
  0.0189681   0.0193366    0.00727019  …  0.014465    0.0263022   0.020402  
  0.0190682   0.000678911  0.0322396      0.0236176   0.0233977   0.0201988 
 -0.00441231  0.0105124    0.0108247      0.0312933   0.0182148   0.00521477
  0.0373295   0.0112624    0.0174922      0.0209972   0.0507611   0.0142456 
  0.00217496  0.0315914    0.0222844      0.0143468   0.0281275   0.00932025
  0.0313247   0.0282472    0.00793672  …  0.0130704   0.0200056   0.0152904 
  0.0418121   0.0261307    0.0114143      0.0068133   0.0244468   0.0211374 
  0.0249877   0.0205697    0.0247884      0.0248403   0.0171583   0.0355916 
  0.0419023   0.0293111    0.0112945      0.0293121   0.0461871   0.0148807 
  0.0197863   0.0271662    0.0101346      0.922018    0.0224457   0.0109136 
  0.0513473   0.0278968    0.0304234   …  0.0224457   0.631799    0.00977053
  0.012317    0.0235471    0.0244011      0.0109136   0.00977053  0.979549  

In [16]:
# eigen-decomposition
(w, v) = eig(-ss[1])

-w  # eigenvalues


Out[16]:
1092-element Array{Float64,1}:
 39.6052     
 16.5196     
  5.18983    
  4.78401    
  4.59256    
  4.48968    
  4.07638    
  4.00637    
  3.89435    
  3.6689     
  3.55935    
  3.37597    
  3.2553     
  ⋮          
  0.240203   
  0.236452   
  0.23431    
  0.231839   
  0.218114   
  0.215627   
  0.198077   
  0.18193    
  0.129535   
  0.0526874  
  0.0512418  
 -3.06768e-14

In [17]:
v[:,1]  # the first eigenvector


Out[17]:
1092-element Array{Float64,1}:
 -0.0147124
 -0.0154688
 -0.0135991
 -0.0151747
 -0.0146118
 -0.0122881
 -0.0136388
 -0.0135484
 -0.0150387
 -0.0134909
 -0.0138333
 -0.0133675
 -0.0135584
  ⋮        
 -0.0111869
 -0.0132867
 -0.0128998
 -0.0133382
 -0.0121674
 -0.0130977
 -0.0139683
 -0.0140026
 -0.0163832
 -0.0139423
 -0.0127312
 -0.0121628

In [18]:
# load the Gadfly package for plotting
using Gadfly


WARNING: Method definition describe(AbstractArray) in module StatsBase at /opt/julia_packages/.julia/v0.5/StatsBase/src/scalarstats.jl:573 overwritten in module DataFrames at /opt/julia_packages/.julia/v0.5/DataFrames/src/abstractdataframe/abstractdataframe.jl:407.

In [19]:
using StatsBase

# get population information
pop = seqGetData(f, "sample.annotation/Population")
countmap(pop)


Out[19]:
Dict{String,Int64} with 14 entries:
  "CHS" => 100
  "PUR" => 55
  "CLM" => 60
  "FIN" => 93
  "TSI" => 98
  "LWK" => 97
  "JPT" => 89
  "MXL" => 66
  "CEU" => 85
  "YRI" => 88
  "ASW" => 61
  "CHB" => 97
  "IBS" => 14
  "GBR" => 89

In [20]:
# get ancestry information
ancestry = seqGetData(f, "sample.annotation/Ancestry")
countmap(ancestry)


Out[20]:
Dict{String,Int64} with 4 entries:
  "EastAsia"     => 286
  "Europe"       => 379
  "SouthAmerica" => 181
  "Africa"       => 246

In [21]:
# add colors
plot(x=v[:,1], y=v[:,2], color=ancestry, Geom.point, Guide.xlabel("PC 1"), Guide.ylabel("PC 2"))


Out[21]:
PC 1 -0.150 -0.125 -0.100 -0.075 -0.050 -0.025 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 -0.125 -0.120 -0.115 -0.110 -0.105 -0.100 -0.095 -0.090 -0.085 -0.080 -0.075 -0.070 -0.065 -0.060 -0.055 -0.050 -0.045 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 0.165 0.170 0.175 0.180 -0.2 -0.1 0.0 0.1 0.2 -0.13 -0.12 -0.11 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 Europe EastAsia SouthAmerica Africa Color -0.25 -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 -0.200 -0.195 -0.190 -0.185 -0.180 -0.175 -0.170 -0.165 -0.160 -0.155 -0.150 -0.145 -0.140 -0.135 -0.130 -0.125 -0.120 -0.115 -0.110 -0.105 -0.100 -0.095 -0.090 -0.085 -0.080 -0.075 -0.070 -0.065 -0.060 -0.055 -0.050 -0.045 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 0.165 0.170 0.175 0.180 0.185 0.190 0.195 0.200 0.205 0.210 0.215 0.220 0.225 0.230 0.235 0.240 0.245 0.250 -0.2 0.0 0.2 0.4 -0.20 -0.19 -0.18 -0.17 -0.16 -0.15 -0.14 -0.13 -0.12 -0.11 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 PC 2