Many genome assembly tools will write variant SNP calls to the VCF format (variant call format). This is a plain text file that stores variant calls relative to a reference genome in tabular format. It includes a lot of additional information about the quality of SNP calls, etc., but is not very easy to read or efficient to parse. To make analyses run a bit faster ipyrad uses a simplified format to store this information in the form of an HDF5 database. You can easily convert any VCF file to this HDF5 format using the
This tool includes an added benefit of allowing you to enter an (optional)
ld_block_size argument when creating the file which will store information that can be used downstream by many other tools to subsample SNPs and perform bootstrap resampling in a way that reduces the effects of linkage among SNPs. If your data are assembled RAD data then the ld_block_size is not required, since we can simply use RAD loci as the linkage blocks. But if you want to combine reference-mapped RAD loci located nearby in the genome as being on the same linkage block then you can enter a value such as 50,000 to create 50Kb linkage block that will join many RAD loci together and sample only 1 SNP per block in each bootstrap replicate. If your data are not RAD data, e.g., whole genome data, then the ld_block_size argument will be required in order to encode linkage information as discrete blocks into your database.
In :# conda install ipyrad -c bioconda # conda install htslib -c bioconda # conda install bcftools -c bioconda
In :import ipyrad.analysis as ipa import pandas as pd
You can use the program
bcftools to pre-filter your data to exclude indels and low quality SNPs. If you ran the
conda install commands above then you will have all of the required tools installed. To achieve the format that ipyrad expects you will need to exclude indel containing SNPs (this may change in the future). Further quality filtering is optional.
The example below reduced the size of a VCF data file from 29Gb to 80Mb! VCF contains a lot of information that you do not need to retain through all of your analyses. We will keep only the final genotype calls.
Note that the code below is bash script. You can run this from a terminal, or in a jupyter notebook by appending the (%%bash) header like below.
In [ ]:%%bash # compress the VCF file if not already done (creates .vcf.gz) bgzip data.vcf # tabix index the compressed VCF (creates .vcf.gz.tbi) tabix data.vcf.gz # remove multi-allelic SNPs and INDELs and PIPE to next command bcftools view -m2 -M2 -i'CIGAR="1X" & QUAL>30' data.vcf.gz -Ou | # remove extra annotations/formatting info and save to new .vcf bcftools annotate -x FORMAT,INFO > data.cleaned.vcf # recompress the final file (create .vcf.gz) bgzip data.cleaned.vcf
In :# load the VCF as an datafram dfchunks = pd.read_csv( "/home/deren/Documents/ipyrad/sandbox/Macaque-Chr1.clean.vcf.gz", sep="\t", skiprows=1000, chunksize=1000, ) # show first few rows of first dataframe chunk next(dfchunks).head()
NC_018152.2 51273 . G A 280.482 ..1 ..2 GT 0/0 ... 0/0.9 0/0.10 0/0.11 0/0.12 0/0.13 0/0.14 0/0.15 0/0.16 0/0.17 0/1.1 0 NC_018152.2 51292 . A G 16750.300 . . GT 1/1 ... 1/1 . 1/1 1/1 1/1 1/1 0/0 1/1 1/1 1/1 1 NC_018152.2 51349 . A G 628.563 . . GT 0/0 ... 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 2 NC_018152.2 51351 . C T 943.353 . . GT 0/0 ... 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 3 NC_018152.2 51352 . G A 607.681 . . GT 0/0 ... 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 4 NC_018152.2 51398 . C T 510.120 . . GT 0/0 ... 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0
5 rows × 29 columns
Here I using a VCF file from whole geome data for 20 monkey's from an unpublished study (in progress). It contains >6M SNPs all from chromosome 1. Because many SNPs are close together and thus tightly linked we will likely wish to take linkage into account in our downstream analyses.
The ipyrad analysis tools can do this by encoding linkage block information into the HDF5 file. Here we encode
ld_block_size of 20K bp. This breaks the 1 scaffold (chromosome) into about 10K linkage blocks. See the example below of this information being used in an ipyrad PCA analysis.
In :# init a conversion tool converter = ipa.vcf_to_hdf5( name="Macaque_LD20K", data="/home/deren/Documents/ipyrad/sandbox/Macaque-Chr1.clean.vcf.gz", ld_block_size=20000, ) # run the converter converter.run()
Indexing VCF to HDF5 database file VCF: 6094152 SNPs; 1 scaffolds [####################] 100% 0:02:22 | converting VCF to HDF5 HDF5: 6094152 SNPs; 10845 linkage group SNP database written to ./analysis-vcf2hdf5/Macaque_LD20K.snps.hdf5
In :# init a PCA tool and filter to allow no missing data pca = ipa.pca( data="./analysis-vcf2hdf5/Macaque_LD20K.snps.hdf5", mincov=1.0, )
Samples: 20 Sites before filtering: 6094152 Filtered (indels): 0 Filtered (bi-allel): 0 Filtered (mincov): 794597 Filtered (minmap): 0 Filtered (combined): 794597 Sites after filtering: 5299555 Sites containing missing values: 0 (0.00%) Missing values in SNP matrix: 0 (0.00%)
In :pca.run_and_plot_2D(0, 1, seed=123);
Subsampling SNPs: 10841/5299555
Here you can see the results for a different 10K SNPs that are sampled in each replicate iteration. If the signal in the data is robust then we should expect to see the points clustering at a similar place across replicates. Internally ipyrad will rotate axes to ensure the replicate plots align despite axes swapping (which is arbitrary in PCA space). You can see this provides a better view of uncertainty in our estimates than the plot above (and it looks cool!)
In :pca.run_and_plot_2D(0, 1, seed=123, nreplicates=25);
Subsampling SNPs: 10841/5299555