This demo would do the following:

  1. Get some human genome reference data for a specific region of the human genome. This will show off I/O, sniffing, and loading remote gzipped files.
  2. Load Greg's 23andme SNP profile data for the same region.
  3. Determine how Greg's SNP profile compares to the reference genome. This can show off some skbio.sequence API stuff maybe?
  4. Load the 1000 genomes variant data for this region into sequence objects. Build a sklearn classifier to classify population (pop in their metadata file) based on characters present at specific positions.
  5. Apply the classifier to classify Greg's population using his SNP data, which will be a subset of the 1000 genomes data.
  • [x] reference human assembly build 37 (also known as Annotation Release 104) exported from 1000 Genomes; working below with chr21:33,177,777-33,708,621
  • [x] Greg's 23andme data relatable to the reference genome
  • [x] variant data from 1000 genomes as vcf
  • [x] host metadata from 1000 genomes
  • [ ] vcf to fasta (in progress, just not working right)
  • [ ] classifiers

Step 1

Targetting this region of the human genome (a basically random choice): chr21:33,177,777-33,708,621

This URL looks promising for auto-generating, relevant variables are defined below. http://browser.1000genomes.org/Homo_sapiens/Export/Output/Location?db=core;flank3_display=0;flank5_display=0;output=fasta;r=21:33177777-33708621;strand=feature;coding=yes;cdna=yes;peptide=yes;utr3=yes;exon=yes;intron=yes;genomic=hard_masked;utr5=yes;_format=TextGz


In [1]:
from __future__ import division

# variables defining the targeted region
chromosome = 21
start_pos = 33177777
end_pos = 33708621

In [2]:
# The file comes down as ``ensembl.txt`` - we can sniff that this is fasta, and then load into a sequence. 

## The Ns in this sequence, I believe, are masked repeats or low-complexity regions.
## Confirm that there are not lowercase characters in here...

reference_fasta_fp = "ensembl.txt"
!head $reference_fasta_fp












In [3]:
import skbio
ref = skbio.io.read(reference_fasta_fp, into=skbio.DNA)

Step 2 (no skbio here)


In [4]:
import pandas as pd
import numpy as np

# this file isn't public b/c I'm not sure that I want it to be. Greg will share the link...
greg_data_fp = "genome_Gregory_Caporaso_Full_20150702133442.txt"
df = pd.read_table(greg_data_fp,
                   comment='#',
                   names=["rsid", "chromosome", "position", "genotype"])
chromosome_in_range = df['chromosome'] == chromosome
position_in_range = np.logical_and(start_pos <= df['position'], end_pos >= df['position'])

gregs_data = df[np.logical_and(chromosome_in_range, position_in_range)]
gregs_data[:10]


/Users/caporaso/.virtualenvs/skbio023/lib/python2.7/site-packages/pandas/io/parsers.py:1164: DtypeWarning: Columns (1) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)
Out[4]:
rsid chromosome position genotype
909696 rs2833521 21 33179293 AA
909697 rs2833522 21 33179371 AG
909698 rs2833524 21 33182304 --
909699 rs1012482 21 33182482 AG
909700 rs7275656 21 33186188 TT
909701 rs2833527 21 33186415 AG
909702 rs7283466 21 33193131 GG
909703 rs2833530 21 33200096 AA
909704 rs8133196 21 33206932 CC
909705 rs2833536 21 33212509 TT

Step 3


In [5]:
ref


Out[5]:
<DNASequence: NNNNNNNNNN... (length: 530845)>

In [6]:
# Find out what base is in the reference for one of my SNPs. In this case, 
# I'm heterozygous for A/G at this position, and the reference has a G.
ref[33179371 - start_pos]


Out[6]:
<DNASequence: G (length: 1)>

In [59]:
# Find out how many times I'm the same as the reference.

count_same = 0
count = 0
for e in gregs_data.iterrows():
    count += 1
    if str(ref[e[1].position - start_pos]) in e[1].genotype:
        count_same += 1

print count_same / count, count


0.515555555556 225

Step 4

Building fasta from vcf files

  • trying this with GATK (requires java 1.7.0 or greater) based on the Biostars discussion here.

In [75]:
# First we need to create the index and dict files, as described here:
# https://www.broadinstitute.org/gatk/guide/article?id=2798

reference_dict_fp = '%s.dict' % reference_fasta_fp
!java -jar /Applications/picard-tools-1.135/picard.jar CreateSequenceDictionary R= $reference_fasta_fp O= $reference_dict_fp

reference_idx_fp = '%s.fai' % reference_fasta_fp
!samtools faidx $reference_fasta_fp


[Thu Jul 02 16:36:39 MST 2015] picard.sam.CreateSequenceDictionary REFERENCE=ensembl.txt OUTPUT=ensembl.txt.dict    TRUNCATE_NAMES_AT_WHITESPACE=true NUM_SEQUENCES=2147483647 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Thu Jul 02 16:36:39 MST 2015] Executing as caporaso@egr18.egr.nau.edu on Mac OS X 10.9.5 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_45-b14; Picard version: 1.135(83ec44e03ec8d07ef180ef86b7a62d59a80dedd3_1435607037) JdkDeflater
[Thu Jul 02 16:36:39 MST 2015] picard.sam.CreateSequenceDictionary done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=257425408

In [67]:
# Then, get the vcf file corresponding to this region - I don't think we can change this URL,
# it points to a file that I initiated creation of on the web server.

!wget http://browser.1000genomes.org/tmp/slicer/filtered_21.33177777-33708621.ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
!gunzip filtered_21.33177777-33708621.ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
!head filtered_21.33177777-33708621.ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf


--2015-07-02 16:13:09--  http://browser.1000genomes.org/tmp/slicer/filtered_21.33177777-33708621.ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
Resolving browser.1000genomes.org... 193.62.193.83
Connecting to browser.1000genomes.org|193.62.193.83|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 404118 (395K) [application/gzip]
Saving to: 'filtered_21.33177777-33708621.ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz'

100%[======================================>] 404,118      328KB/s   in 1.2s   

2015-07-02 16:13:11 (328 KB/s) - 'filtered_21.33177777-33708621.ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz' saved [404118/404118]


In [76]:
!java -Xmx2g -jar /Applications/GenomeAnalysisTK.jar -R ensembl.txt -T FastaAlternateReferenceMaker -o MY_REFERENCE_WITH_SNPS_FROM_VCF.fa --variant filtered_21.33177777-33708621.ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf


INFO  16:36:51,460 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  16:36:51,462 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-0-g7e26428, Compiled 2015/05/15 03:25:41 
INFO  16:36:51,462 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  16:36:51,462 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  16:36:51,466 HelpFormatter - Program Args: -R ensembl.txt -T FastaAlternateReferenceMaker -o MY_REFERENCE_WITH_SNPS_FROM_VCF.fa --variant filtered_21.33177777-33708621.ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf 
INFO  16:36:51,470 HelpFormatter - Executing as caporaso@egr18.egr.nau.edu on Mac OS X 10.9.5 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_45-b14. 
INFO  16:36:51,470 HelpFormatter - Date/Time: 2015/07/02 16:36:51 
INFO  16:36:51,470 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  16:36:51,470 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  16:36:51,754 GenomeAnalysisEngine - Strictness is SILENT 
INFO  16:36:51,850 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  16:36:52,212 RMDTrackBuilder - Writing Tribble index to disk for file /Users/caporaso/Dropbox/code/sketchbook/2015.07.02-scipy2015-experiments/filtered_21.33177777-33708621.ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.idx 
INFO  16:36:52,484 GenomeAnalysisEngine - Preparing for traversal 
INFO  16:36:52,485 GenomeAnalysisEngine - Done preparing for traversal 
INFO  16:36:52,486 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  16:36:52,486 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  16:36:52,486 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
INFO  16:36:52,909 ProgressMeter -            done    530845.0     0.0 s       0.0 s      100.0%     0.0 s       0.0 s 
INFO  16:36:52,909 ProgressMeter - Total runtime 0.42 secs, 0.01 min, 0.00 hours 
INFO  16:36:53,582 GATKRunReport - Uploaded run statistics report to AWS S3 

In [9]:
## For some reason, the result is the same as my reference sequence... Is this a problem with how I'm 
## using GATK, or with my VCF file (i.e., there really is no variation from the reference). Also, 
## this is only generating one sequence. I want one per subject... 

## Looking at the vcf file, I see a lot of "0|0" in the first column of subject variant data, 
## so it may really be that there is no variation in this region and I should pick another. Puzzling though
## that I have so much variation if that's the case - am I doing something wrong there??

vcp_snped_ref = skbio.io.read('MY_REFERENCE_WITH_SNPS_FROM_VCF.fa', into=skbio.DNA)
len(vcp_snped_ref), len(ref)
vcp_snped_ref.fraction_diff(ref)


Out[9]:
0.0

Step 5 (not really started yet)


In [8]:
# This is the 1000 genomes sample (i.e., host) metadata (John, do you know where the lookup is for the "pop" field?)

!wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel
!head -n 50 integrated_call_samples_v3.20130502.ALL.panel


--2015-07-02 17:10:41--  ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel
           => 'integrated_call_samples_v3.20130502.ALL.panel.2'
Resolving ftp.1000genomes.ebi.ac.uk... 193.62.192.8
Connecting to ftp.1000genomes.ebi.ac.uk|193.62.192.8|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /vol1/ftp/release/20130502 ... done.
==> SIZE integrated_call_samples_v3.20130502.ALL.panel ... 55156
==> PASV ... done.    ==> RETR integrated_call_samples_v3.20130502.ALL.panel ... done.
Length: 55156 (54K) (unauthoritative)

100%[======================================>] 55,156       154KB/s   in 0.4s   

2015-07-02 17:10:45 (154 KB/s) - 'integrated_call_samples_v3.20130502.ALL.panel.2' saved [55156]

sample	pop	super_pop	gender		
HG00096	GBR	EUR	male
HG00097	GBR	EUR	female
HG00099	GBR	EUR	female
HG00100	GBR	EUR	female
HG00101	GBR	EUR	male
HG00102	GBR	EUR	female
HG00103	GBR	EUR	male
HG00105	GBR	EUR	male
HG00106	GBR	EUR	female
HG00107	GBR	EUR	male
HG00108	GBR	EUR	male
HG00109	GBR	EUR	male
HG00110	GBR	EUR	female
HG00111	GBR	EUR	female
HG00112	GBR	EUR	male
HG00113	GBR	EUR	male
HG00114	GBR	EUR	male
HG00115	GBR	EUR	male
HG00116	GBR	EUR	male
HG00117	GBR	EUR	male
HG00118	GBR	EUR	female
HG00119	GBR	EUR	male
HG00120	GBR	EUR	female
HG00121	GBR	EUR	female
HG00122	GBR	EUR	female
HG00123	GBR	EUR	female
HG00125	GBR	EUR	female
HG00126	GBR	EUR	male
HG00127	GBR	EUR	female
HG00128	GBR	EUR	female
HG00129	GBR	EUR	male
HG00130	GBR	EUR	female
HG00131	GBR	EUR	male
HG00132	GBR	EUR	female
HG00133	GBR	EUR	female
HG00136	GBR	EUR	male
HG00137	GBR	EUR	female
HG00138	GBR	EUR	male
HG00139	GBR	EUR	male
HG00140	GBR	EUR	male
HG00141	GBR	EUR	male
HG00142	GBR	EUR	male
HG00143	GBR	EUR	male
HG00145	GBR	EUR	male
HG00146	GBR	EUR	female
HG00148	GBR	EUR	male
HG00149	GBR	EUR	male
HG00150	GBR	EUR	female
HG00151	GBR	EUR	male

In [ ]: