Julia tutorial with parallel programming


In [1]:
# load the JSeqArray package
using JSeqArray

Open an existing SeqArray file and display its structure. The dimensions of sample.id and variant.id tell you the total numbers of samples and variants, i.e., 1092 samples and 19,773 variants.


In [2]:
# get the file name of example data
fn = seqExample(:kg)
f = seqOpen(fn)


Out[2]:
SeqArray File: /home/juser/.julia/v0.5/JSeqArray/demo/data/1KG_phase1_release_v3_chr22.gds (0B)
+    [  ] *
|--+ description   [  ] *
|--+ sample.id   { Str8 1092 LZMA_ra(10.5%), 914B } *
|--+ variant.id   { Int32 19773 LZMA_ra(8.39%), 6.6K } *
|--+ position   { Int32 19773 LZMA_ra(52.0%), 41.1K } *
|--+ chromosome   { Str8 19773 LZMA_ra(0.28%), 166B } *
|--+ allele   { Str8 19773 LZMA_ra(22.7%), 111.9K } *
|--+ genotype   [  ] *
|  |--+ data   { Bit2 2x1092x19773 LZMA_ra(8.17%), 882.5K } *
|  |--+ extra.index   { Int32 3x0 LZMA_ra, 19B } *
|  \--+ extra   { Int16 0 LZMA_ra, 19B }
|--+ phase   [  ]
|  |--+ data   { Bit1 1092x19773 LZMA_ra(0.02%), 550B } *
|  |--+ extra.index   { Int32 3x0 LZMA_ra, 19B } *
|  \--+ extra   { Bit1 0 LZMA_ra, 19B }
|--+ annotation   [  ]
|  |--+ id   { Str8 19773 LZMA_ra(35.2%), 77.0K } *
|  |--+ qual   { Float32 19773 LZMA_ra(3.62%), 2.9K } *
|  |--+ filter   { Int32,factor 19773 LZMA_ra(0.21%), 170B } *
|  |--+ info   [  ]
|  \--+ format   [  ]
\--+ sample.annotation   [  ]
   |--+ Family.ID   { Str8 1092 LZMA_ra(15.3%), 1.1K }
   |--+ Population   { Str8 1092 LZMA_ra(5.08%), 222B }
   |--+ Gender   { Str8 1092 LZMA_ra(5.85%), 386B }
   \--+ Ancestry   { Str8 1092 LZMA_ra(2.43%), 233B }

Genotypic data and annotations are stored in an array-oriented manner, providing efficient data access using the Julia programming language. seqFilterSet() and seqGetData() can be used together to retrieve data for a selected set of samples from a defined genomic region. seqApply() applies a user-defined function to array margins of genotypes and annotations.


In [3]:
# launches the built-in ClusterManager with 4 worker processes and return worker IDs
addprocs(4)


Out[3]:
4-element Array{Int64,1}:
 2
 3
 4
 5

Calculation of Allele Frequencies


In [4]:
seqFilterReset(f)  # reset the filter


# of selected samples: 1,092
# of selected variants: 19,773

Using 1 core


In [5]:
af = seqApply(f, "genotype", asis=:unlist) do geno::Array{UInt8,3}
    N = size(geno, 3)
    rv = Vector{Float64}(N)
    for k in 1:N
        sum = 0; n = 0
        for g in geno[:,:,k]
            if g != 0xFF
                sum += g == 0  # 0 is the reference allele
                n += 1
            end
        end
        rv[k] = sum / n
    end
    return rv
end


[==================================================] 100%, completed in 0s
Out[5]:
19773-element Array{Float64,1}:
 0.695055   
 0.943223   
 0.999542   
 0.999542   
 0.938645   
 0.999084   
 0.000457875
 0.988095   
 0.969322   
 0.989011   
 0.990385   
 0.947802   
 0.998168   
 ⋮          
 0.91163    
 0.999084   
 0.999542   
 0.992216   
 0.981685   
 0.981685   
 0.952839   
 0.994048   
 0.999542   
 0.996795   
 0.658425   
 0.913462   

Using 4 cores from the built-in ClusterManager


In [6]:
af = seqParallel(f) do gdsfile
    v = seqApply(gdsfile, "genotype", asis=:unlist) do geno::Array{UInt8,3}
        N = size(geno, 3)
        rv = Vector{Float64}(N)
        for k in 1:N
            sum = 0; n = 0
            for g in geno[:,:,k]
                if g != 0xFF
                    sum += g == 0  # 0 is the reference allele
                    n += 1
                end
            end
            rv[k] = sum / n
        end
        return rv
    end
    return v
end


[==================================================] 100%, completed in 0s
[==================================================] 100%, completed in 0s
[==================================================] 100%, completed in 0s
[==================================================] 100%, completed in 0s
Out[6]:
19773-element Array{Float64,1}:
 0.695055   
 0.943223   
 0.999542   
 0.999542   
 0.938645   
 0.999084   
 0.000457875
 0.988095   
 0.969322   
 0.989011   
 0.990385   
 0.947802   
 0.998168   
 ⋮          
 0.91163    
 0.999084   
 0.999542   
 0.992216   
 0.981685   
 0.981685   
 0.952839   
 0.994048   
 0.999542   
 0.996795   
 0.658425   
 0.913462   

Principal Component Analysis

Using 1 core


In [7]:
# initialize the covariance matrix
ss = Any[0.0]

seqApply(f, "#dosage", ss) do geno::Matrix{UInt8}, ss::Vector{Any}
    # calculate allele frequencies
    N = size(geno, 2); af = Vector{Float64}(N)
    for i in 1:N
        g = geno[:,i]; af[i] = mean(g[g .!= 0xFF])
    end
    af *= 0.5
    # normalized by allele frequencies
    g = Matrix{Float64}(size(geno))
    for i in 1:N
        g[:,i] = (geno[:,i] - 2*af[i]) / sqrt(af[i] * (1 - af[i]))
    end
    # correct missing genotypes
    g[isnan(g)] = 0.0; g[geno .== 0xFF] = 0.0
    # update the cov matrix
    ss[1] += g * g'
end

# scaled
ss[1] *= size(ss[1], 1) / sum(diag(ss[1]))
ss[1]


[==================================================] 100%, completed in 0s
Out[7]:
1092×1092 Array{Float64,2}:
  0.7017      0.00830316   0.0511373   …  0.0197863   0.0513473   0.012317  
  0.00830316  0.622809     0.033266       0.0271662   0.0278968   0.0235471 
  0.0511373   0.033266     0.773402       0.0101346   0.0304234   0.0244011 
  0.0330731   0.0481347    0.0231505      0.0238398   0.0237653   0.0206459 
  0.0158933   0.0377345    0.0336195      0.00450558  0.0250328   0.022167  
  0.043194    0.0221012    0.0484443   …  0.0144089   0.0330417   0.0256502 
  0.0466153   0.0219024    0.030456       0.0242625   0.00740767  0.0242048 
  0.0108151   0.0258773    0.035232       0.0158535   0.0290021   0.0233435 
  0.0331725   0.0301866    0.0309742      0.0184776   0.0142575   0.0302456 
  0.0272684   0.00494007   0.00776384     0.0196036   0.0113146   0.0267043 
  0.0339346   0.020939     0.0347542   …  0.0162531   0.0106022   0.028762  
  0.0248709   0.0224371    0.0384413      0.0232937   0.0151097   0.0117061 
  0.0265906   0.0344899    0.0139562      0.017142    0.0255764   0.0154684 
  ⋮                                    ⋱              ⋮                     
  0.0189681   0.0193366    0.00727019  …  0.014465    0.0263022   0.020402  
  0.0190682   0.000678911  0.0322396      0.0236176   0.0233977   0.0201988 
 -0.00441231  0.0105124    0.0108247      0.0312933   0.0182148   0.00521477
  0.0373295   0.0112624    0.0174922      0.0209972   0.0507611   0.0142456 
  0.00217496  0.0315914    0.0222844      0.0143468   0.0281275   0.00932025
  0.0313247   0.0282472    0.00793672  …  0.0130704   0.0200056   0.0152904 
  0.0418121   0.0261307    0.0114143      0.0068133   0.0244468   0.0211374 
  0.0249877   0.0205697    0.0247884      0.0248403   0.0171583   0.0355916 
  0.0419023   0.0293111    0.0112945      0.0293121   0.0461871   0.0148807 
  0.0197863   0.0271662    0.0101346      0.922018    0.0224457   0.0109136 
  0.0513473   0.0278968    0.0304234   …  0.0224457   0.631799    0.00977053
  0.012317    0.0235471    0.0244011      0.0109136   0.00977053  0.979549  

Using 4 cores from the built-in ClusterManager


In [8]:
# initialize the covariance matrix
cov2 = seqParallel(f, combine=+) do gdsfile
    ss = Any[0.0]
    seqApply(gdsfile, "#dosage", ss) do geno::Matrix{UInt8}, ss::Vector{Any}
        # calculate allele frequencies
        N = size(geno, 2); af = Vector{Float64}(N)
        for i in 1:N
            g = geno[:,i]; af[i] = mean(g[g .!= 0xFF])
        end
        af *= 0.5
        # normalized by allele frequencies
        g = Matrix{Float64}(size(geno))
        for i in 1:N
            g[:,i] = (geno[:,i] - 2*af[i]) / sqrt(af[i] * (1 - af[i]))
        end
        # correct missing genotypes
        g[isnan(g)] = 0.0; g[geno .== 0xFF] = 0.0
        # update the cov matrix
        ss[1] += g * transpose(g)
    end
    return ss[1]
end

cov2 *= size(cov2, 1) / sum(diag(cov2))
cov2


[==================================================] 100%, completed in 0s
[==================================================] 100%, completed in 0s
[==================================================] 100%, completed in 0s
[==================================================] 100%, completed in 0s
Out[8]:
1092×1092 Array{Float64,2}:
  0.7017      0.00830316   0.0511373   …  0.0197863   0.0513473   0.012317  
  0.00830316  0.622809     0.033266       0.0271662   0.0278968   0.0235471 
  0.0511373   0.033266     0.773402       0.0101346   0.0304234   0.0244011 
  0.0330731   0.0481347    0.0231505      0.0238398   0.0237653   0.0206459 
  0.0158933   0.0377345    0.0336195      0.00450558  0.0250328   0.022167  
  0.043194    0.0221012    0.0484443   …  0.0144089   0.0330417   0.0256502 
  0.0466153   0.0219024    0.030456       0.0242625   0.00740767  0.0242048 
  0.0108151   0.0258773    0.035232       0.0158535   0.0290021   0.0233435 
  0.0331725   0.0301866    0.0309742      0.0184776   0.0142575   0.0302456 
  0.0272684   0.00494007   0.00776384     0.0196036   0.0113146   0.0267043 
  0.0339346   0.020939     0.0347542   …  0.0162531   0.0106022   0.028762  
  0.0248709   0.0224371    0.0384413      0.0232937   0.0151097   0.0117061 
  0.0265906   0.0344899    0.0139562      0.017142    0.0255764   0.0154684 
  ⋮                                    ⋱              ⋮                     
  0.0189681   0.0193366    0.00727019  …  0.014465    0.0263022   0.020402  
  0.0190682   0.000678911  0.0322396      0.0236176   0.0233977   0.0201988 
 -0.00441231  0.0105124    0.0108247      0.0312933   0.0182148   0.00521477
  0.0373295   0.0112624    0.0174922      0.0209972   0.0507611   0.0142456 
  0.00217496  0.0315914    0.0222844      0.0143468   0.0281275   0.00932025
  0.0313247   0.0282472    0.00793672  …  0.0130704   0.0200056   0.0152904 
  0.0418121   0.0261307    0.0114143      0.0068133   0.0244468   0.0211374 
  0.0249877   0.0205697    0.0247884      0.0248403   0.0171583   0.0355916 
  0.0419023   0.0293111    0.0112945      0.0293121   0.0461871   0.0148807 
  0.0197863   0.0271662    0.0101346      0.922018    0.0224457   0.0109136 
  0.0513473   0.0278968    0.0304234   …  0.0224457   0.631799    0.00977053
  0.012317    0.0235471    0.0244011      0.0109136   0.00977053  0.979549  

In [9]:
# eigen-decomposition
(w, v) = eig(-cov2)

-w  # eigenvalues


Out[9]:
1092-element Array{Float64,1}:
 39.6052     
 16.5196     
  5.18983    
  4.78401    
  4.59256    
  4.48968    
  4.07638    
  4.00637    
  3.89435    
  3.6689     
  3.55935    
  3.37597    
  3.2553     
  ⋮          
  0.240203   
  0.236452   
  0.23431    
  0.231839   
  0.218114   
  0.215627   
  0.198077   
  0.18193    
  0.129535   
  0.0526874  
  0.0512418  
 -3.07798e-14

In [10]:
v[:,1]  # the first eigenvector


Out[10]:
1092-element Array{Float64,1}:
 -0.0147124
 -0.0154688
 -0.0135991
 -0.0151747
 -0.0146118
 -0.0122881
 -0.0136388
 -0.0135484
 -0.0150387
 -0.0134909
 -0.0138333
 -0.0133675
 -0.0135584
  ⋮        
 -0.0111869
 -0.0132867
 -0.0128998
 -0.0133382
 -0.0121674
 -0.0130977
 -0.0139683
 -0.0140026
 -0.0163832
 -0.0139423
 -0.0127312
 -0.0121628