In [1]:
# load the JSeqArray package
using JSeqArray
Open an existing SeqArray file and display its structure. The dimensions of sample.id
and variant.id
tell you the total numbers of samples and variants, i.e., 1092 samples and 19,773 variants.
In [2]:
# get the file name of example data
fn = seqExample(:kg)
f = seqOpen(fn)
Out[2]:
Genotypic data and annotations are stored in an array-oriented manner, providing efficient data access using the Julia programming language. seqSetFilter()
and seqGetData()
can be used together to retrieve data for a selected set of samples from a defined genomic region. seqApply()
applies a user-defined function to array margins of genotypes and annotations.
In [3]:
sampid = seqGetData(f, "sample.id") # a list of sample IDs
Out[3]:
In [4]:
varid = seqGetData(f, "variant.id") # a list of variant IDs
Out[4]:
In [5]:
seqGetData(f, "annotation/qual")
Out[5]:
In [6]:
seqGetData(f, "annotation/filter")
Out[6]:
In [7]:
seqFilterSet(f, sample_id=sampid[1:4], variant_id=varid[1:6])
Get data from the subset of variant data.
In [8]:
seqGetData(f, "chromosome")
Out[8]:
In [9]:
seqGetData(f, "allele")
Out[9]:
In [10]:
# 0: the reference allele, 1: the first alternative allele, 0xFF: missing value
geno = seqGetData(f, "genotype")
Out[10]:
In [11]:
dosage = seqGetData(f, "#dosage") # 0xFF: missing value
Out[11]:
In [12]:
seqFilterReset(f) # reset the filter
In [13]:
af = seqApply(f, "genotype", asis=:unlist) do geno::Array{UInt8,3}
N = size(geno, 3)
rv = Vector{Float64}(N)
for k in 1:N
sum = 0; n = 0
for g in geno[:,:,k]
if g != 0xFF
sum += g == 0 # 0 is the reference allele
n += 1
end
end
rv[k] = sum / n
end
return rv
end
Out[13]:
In [14]:
seqFilterReset(f) # reset the filter
In [15]:
# initialize the covariance matrix
ss = Any[0.0]
seqApply(f, "#dosage", ss) do geno::Matrix{UInt8}, ss::Vector{Any}
# calculate allele frequencies
N = size(geno, 2); af = Vector{Float64}(N)
for i in 1:N
g = geno[:,i]; af[i] = mean(g[g .!= 0xFF])
end
af *= 0.5
# normalized by allele frequencies
g = Matrix{Float64}(size(geno))
for i in 1:N
g[:,i] = (geno[:,i] - 2*af[i]) / sqrt(af[i] * (1 - af[i]))
end
# correct missing genotypes
g[isnan(g)] = 0.0; g[geno .== 0xFF] = 0.0
# update the cov matrix
ss[1] += g * g'
end
# scaled
ss[1] *= size(ss[1], 1) / sum(diag(ss[1]))
ss[1]
Out[15]:
In [16]:
# eigen-decomposition
(w, v) = eig(-ss[1])
-w # eigenvalues
Out[16]:
In [17]:
v[:,1] # the first eigenvector
Out[17]:
In [18]:
# load the Gadfly package for plotting
using Gadfly
In [19]:
using StatsBase
# get population information
pop = seqGetData(f, "sample.annotation/Population")
countmap(pop)
Out[19]:
In [20]:
# get ancestry information
ancestry = seqGetData(f, "sample.annotation/Ancestry")
countmap(ancestry)
Out[20]:
In [21]:
# add colors
plot(x=v[:,1], y=v[:,2], color=ancestry, Geom.point, Guide.xlabel("PC 1"), Guide.ylabel("PC 2"))
Out[21]: