This demo would do the following:
pop
in their metadata file) based on characters present at specific positions. chr21:33,177,777-33,708,621
Targetting this region of the human genome (a basically random choice): chr21:33,177,777-33,708,621
This URL looks promising for auto-generating, relevant variables are defined below. http://browser.1000genomes.org/Homo_sapiens/Export/Output/Location?db=core;flank3_display=0;flank5_display=0;output=fasta;r=21:33177777-33708621;strand=feature;coding=yes;cdna=yes;peptide=yes;utr3=yes;exon=yes;intron=yes;genomic=hard_masked;utr5=yes;_format=TextGz
In [1]:
from __future__ import division
# variables defining the targeted region
chromosome = 21
start_pos = 33177777
end_pos = 33708621
In [2]:
# The file comes down as ``ensembl.txt`` - we can sniff that this is fasta, and then load into a sequence.
## The Ns in this sequence, I believe, are masked repeats or low-complexity regions.
## Confirm that there are not lowercase characters in here...
reference_fasta_fp = "ensembl.txt"
!head $reference_fasta_fp
In [3]:
import skbio
ref = skbio.io.read(reference_fasta_fp, into=skbio.DNA)
In [4]:
import pandas as pd
import numpy as np
# this file isn't public b/c I'm not sure that I want it to be. Greg will share the link...
greg_data_fp = "genome_Gregory_Caporaso_Full_20150702133442.txt"
df = pd.read_table(greg_data_fp,
comment='#',
names=["rsid", "chromosome", "position", "genotype"])
chromosome_in_range = df['chromosome'] == chromosome
position_in_range = np.logical_and(start_pos <= df['position'], end_pos >= df['position'])
gregs_data = df[np.logical_and(chromosome_in_range, position_in_range)]
gregs_data[:10]
Out[4]:
In [5]:
ref
Out[5]:
In [6]:
# Find out what base is in the reference for one of my SNPs. In this case,
# I'm heterozygous for A/G at this position, and the reference has a G.
ref[33179371 - start_pos]
Out[6]:
In [59]:
# Find out how many times I'm the same as the reference.
count_same = 0
count = 0
for e in gregs_data.iterrows():
count += 1
if str(ref[e[1].position - start_pos]) in e[1].genotype:
count_same += 1
print count_same / count, count
Building fasta from vcf files
In [75]:
# First we need to create the index and dict files, as described here:
# https://www.broadinstitute.org/gatk/guide/article?id=2798
reference_dict_fp = '%s.dict' % reference_fasta_fp
!java -jar /Applications/picard-tools-1.135/picard.jar CreateSequenceDictionary R= $reference_fasta_fp O= $reference_dict_fp
reference_idx_fp = '%s.fai' % reference_fasta_fp
!samtools faidx $reference_fasta_fp
In [67]:
# Then, get the vcf file corresponding to this region - I don't think we can change this URL,
# it points to a file that I initiated creation of on the web server.
!wget http://browser.1000genomes.org/tmp/slicer/filtered_21.33177777-33708621.ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
!gunzip filtered_21.33177777-33708621.ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
!head filtered_21.33177777-33708621.ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf
In [76]:
!java -Xmx2g -jar /Applications/GenomeAnalysisTK.jar -R ensembl.txt -T FastaAlternateReferenceMaker -o MY_REFERENCE_WITH_SNPS_FROM_VCF.fa --variant filtered_21.33177777-33708621.ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf
In [9]:
## For some reason, the result is the same as my reference sequence... Is this a problem with how I'm
## using GATK, or with my VCF file (i.e., there really is no variation from the reference). Also,
## this is only generating one sequence. I want one per subject...
## Looking at the vcf file, I see a lot of "0|0" in the first column of subject variant data,
## so it may really be that there is no variation in this region and I should pick another. Puzzling though
## that I have so much variation if that's the case - am I doing something wrong there??
vcp_snped_ref = skbio.io.read('MY_REFERENCE_WITH_SNPS_FROM_VCF.fa', into=skbio.DNA)
len(vcp_snped_ref), len(ref)
vcp_snped_ref.fraction_diff(ref)
Out[9]:
In [8]:
# This is the 1000 genomes sample (i.e., host) metadata (John, do you know where the lookup is for the "pop" field?)
!wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel
!head -n 50 integrated_call_samples_v3.20130502.ALL.panel
In [ ]: