This demo would do the following:

Get some human genome reference data for a specific region of the human genome. This will show off I/O, sniffing, and loading remote gzipped files.
Load Greg's 23andme SNP profile data for the same region.
Determine how Greg's SNP profile compares to the reference genome. This can show off some skbio.sequence API stuff maybe?
Load the 1000 genomes variant data for this region into sequence objects. Build a sklearn classifier to classify population (pop in their metadata file) based on characters present at specific positions.
Apply the classifier to classify Greg's population using his SNP data, which will be a subset of the 1000 genomes data.

[x] reference human assembly build 37 (also known as Annotation Release 104) exported from 1000 Genomes; working below with chr21:33,177,777-33,708,621
[x] Greg's 23andme data relatable to the reference genome
[x] variant data from 1000 genomes as vcf
[x] host metadata from 1000 genomes
[ ] vcf to fasta (in progress, just not working right)
[ ] classifiers

Step 1

Targetting this region of the human genome (a basically random choice): chr21:33,177,777-33,708,621

This URL looks promising for auto-generating, relevant variables are defined below. http://browser.1000genomes.org/Homo_sapiens/Export/Output/Location?db=core;flank3_display=0;flank5_display=0;output=fasta;r=21:33177777-33708621;strand=feature;coding=yes;cdna=yes;peptide=yes;utr3=yes;exon=yes;intron=yes;genomic=hard_masked;utr5=yes;_format=TextGz



In [1]:

    
from __future__ import division

# variables defining the targeted region
chromosome = 21
start_pos = 33177777
end_pos = 33708621



In [2]:

    
# The file comes down as ``ensembl.txt`` - we can sniff that this is fasta, and then load into a sequence. 

## The Ns in this sequence, I believe, are masked repeats or low-complexity regions.
## Confirm that there are not lowercase characters in here...

reference_fasta_fp = "ensembl.txt"
!head $reference_fasta_fp



In [3]:

    
import skbio
ref = skbio.io.read(reference_fasta_fp, into=skbio.DNA)

Step 2 (no skbio here)



In [4]:

    
import pandas as pd
import numpy as np

# this file isn't public b/c I'm not sure that I want it to be. Greg will share the link...
greg_data_fp = "genome_Gregory_Caporaso_Full_20150702133442.txt"
df = pd.read_table(greg_data_fp,
                   comment='#',
                   names=["rsid", "chromosome", "position", "genotype"])
chromosome_in_range = df['chromosome'] == chromosome
position_in_range = np.logical_and(start_pos <= df['position'], end_pos >= df['position'])

gregs_data = df[np.logical_and(chromosome_in_range, position_in_range)]
gregs_data[:10]









    



/Users/caporaso/.virtualenvs/skbio023/lib/python2.7/site-packages/pandas/io/parsers.py:1164: DtypeWarning: Columns (1) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)






    Out[4]:






  
    
      
      rsid
      chromosome
      position
      genotype
    
  
  
    
      909696
      rs2833521
      21
      33179293
      AA
    
    
      909697
      rs2833522
      21
      33179371
      AG
    
    
      909698
      rs2833524
      21
      33182304
      --
    
    
      909699
      rs1012482
      21
      33182482
      AG
    
    
      909700
      rs7275656
      21
      33186188
      TT
    
    
      909701
      rs2833527
      21
      33186415
      AG
    
    
      909702
      rs7283466
      21
      33193131
      GG
    
    
      909703
      rs2833530
      21
      33200096
      AA
    
    
      909704
      rs8133196
      21
      33206932
      CC
    
    
      909705
      rs2833536
      21
      33212509
      TT

Step 3



In [5]:

    
ref









    Out[5]:





<DNASequence: NNNNNNNNNN... (length: 530845)>



In [6]:

    
# Find out what base is in the reference for one of my SNPs. In this case, 
# I'm heterozygous for A/G at this position, and the reference has a G.
ref[33179371 - start_pos]









    Out[6]:





<DNASequence: G (length: 1)>



In [59]:

    
# Find out how many times I'm the same as the reference.

count_same = 0
count = 0
for e in gregs_data.iterrows():
    count += 1
    if str(ref[e[1].position - start_pos]) in e[1].genotype:
        count_same += 1

print count_same / count, count









    



0.515555555556 225

Step 4

Building fasta from vcf files

trying this with GATK (requires java 1.7.0 or greater) based on the Biostars discussion here.



In [75]:

    
# First we need to create the index and dict files, as described here:
# https://www.broadinstitute.org/gatk/guide/article?id=2798

reference_dict_fp = '%s.dict' % reference_fasta_fp
!java -jar /Applications/picard-tools-1.135/picard.jar CreateSequenceDictionary R= $reference_fasta_fp O= $reference_dict_fp

reference_idx_fp = '%s.fai' % reference_fasta_fp
!samtools faidx $reference_fasta_fp









    



[Thu Jul 02 16:36:39 MST 2015] picard.sam.CreateSequenceDictionary REFERENCE=ensembl.txt OUTPUT=ensembl.txt.dict    TRUNCATE_NAMES_AT_WHITESPACE=true NUM_SEQUENCES=2147483647 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Thu Jul 02 16:36:39 MST 2015] Executing as caporaso@egr18.egr.nau.edu on Mac OS X 10.9.5 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_45-b14; Picard version: 1.135(83ec44e03ec8d07ef180ef86b7a62d59a80dedd3_1435607037) JdkDeflater
[Thu Jul 02 16:36:39 MST 2015] picard.sam.CreateSequenceDictionary done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=257425408



In [67]:

    
# Then, get the vcf file corresponding to this region - I don't think we can change this URL,
# it points to a file that I initiated creation of on the web server.

!wget http://browser.1000genomes.org/tmp/slicer/filtered_21.33177777-33708621.ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
!gunzip filtered_21.33177777-33708621.ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
!head filtered_21.33177777-33708621.ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf









    



--2015-07-02 16:13:09--  http://browser.1000genomes.org/tmp/slicer/filtered_21.33177777-33708621.ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
Resolving browser.1000genomes.org... 193.62.193.83
Connecting to browser.1000genomes.org|193.62.193.83|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 404118 (395K) [application/gzip]
Saving to: 'filtered_21.33177777-33708621.ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz'

100%[======================================>] 404,118      328KB/s   in 1.2s   

2015-07-02 16:13:11 (328 KB/s) - 'filtered_21.33177777-33708621.ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz' saved [404118/404118]



In [76]:

    
!java -Xmx2g -jar /Applications/GenomeAnalysisTK.jar -R ensembl.txt -T FastaAlternateReferenceMaker -o MY_REFERENCE_WITH_SNPS_FROM_VCF.fa --variant filtered_21.33177777-33708621.ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf









    



INFO  16:36:51,460 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  16:36:51,462 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-0-g7e26428, Compiled 2015/05/15 03:25:41 
INFO  16:36:51,462 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  16:36:51,462 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  16:36:51,466 HelpFormatter - Program Args: -R ensembl.txt -T FastaAlternateReferenceMaker -o MY_REFERENCE_WITH_SNPS_FROM_VCF.fa --variant filtered_21.33177777-33708621.ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf 
INFO  16:36:51,470 HelpFormatter - Executing as caporaso@egr18.egr.nau.edu on Mac OS X 10.9.5 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_45-b14. 
INFO  16:36:51,470 HelpFormatter - Date/Time: 2015/07/02 16:36:51 
INFO  16:36:51,470 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  16:36:51,470 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  16:36:51,754 GenomeAnalysisEngine - Strictness is SILENT 
INFO  16:36:51,850 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  16:36:52,212 RMDTrackBuilder - Writing Tribble index to disk for file /Users/caporaso/Dropbox/code/sketchbook/2015.07.02-scipy2015-experiments/filtered_21.33177777-33708621.ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.idx 
INFO  16:36:52,484 GenomeAnalysisEngine - Preparing for traversal 
INFO  16:36:52,485 GenomeAnalysisEngine - Done preparing for traversal 
INFO  16:36:52,486 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  16:36:52,486 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  16:36:52,486 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
INFO  16:36:52,909 ProgressMeter -            done    530845.0     0.0 s       0.0 s      100.0%     0.0 s       0.0 s 
INFO  16:36:52,909 ProgressMeter - Total runtime 0.42 secs, 0.01 min, 0.00 hours 
INFO  16:36:53,582 GATKRunReport - Uploaded run statistics report to AWS S3



In [9]:

    
## For some reason, the result is the same as my reference sequence... Is this a problem with how I'm 
## using GATK, or with my VCF file (i.e., there really is no variation from the reference). Also, 
## this is only generating one sequence. I want one per subject... 

## Looking at the vcf file, I see a lot of "0|0" in the first column of subject variant data, 
## so it may really be that there is no variation in this region and I should pick another. Puzzling though
## that I have so much variation if that's the case - am I doing something wrong there??

vcp_snped_ref = skbio.io.read('MY_REFERENCE_WITH_SNPS_FROM_VCF.fa', into=skbio.DNA)
len(vcp_snped_ref), len(ref)
vcp_snped_ref.fraction_diff(ref)









    Out[9]:





0.0

Step 5 (not really started yet)



In [8]:

    
# This is the 1000 genomes sample (i.e., host) metadata (John, do you know where the lookup is for the "pop" field?)

!wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel
!head -n 50 integrated_call_samples_v3.20130502.ALL.panel









    



--2015-07-02 17:10:41--  ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel
           => 'integrated_call_samples_v3.20130502.ALL.panel.2'
Resolving ftp.1000genomes.ebi.ac.uk... 193.62.192.8
Connecting to ftp.1000genomes.ebi.ac.uk|193.62.192.8|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /vol1/ftp/release/20130502 ... done.
==> SIZE integrated_call_samples_v3.20130502.ALL.panel ... 55156
==> PASV ... done.    ==> RETR integrated_call_samples_v3.20130502.ALL.panel ... done.
Length: 55156 (54K) (unauthoritative)

100%[======================================>] 55,156       154KB/s   in 0.4s   

2015-07-02 17:10:45 (154 KB/s) - 'integrated_call_samples_v3.20130502.ALL.panel.2' saved [55156]

sample	pop	super_pop	gender		
HG00096	GBR	EUR	male
HG00097	GBR	EUR	female
HG00099	GBR	EUR	female
HG00100	GBR	EUR	female
HG00101	GBR	EUR	male
HG00102	GBR	EUR	female
HG00103	GBR	EUR	male
HG00105	GBR	EUR	male
HG00106	GBR	EUR	female
HG00107	GBR	EUR	male
HG00108	GBR	EUR	male
HG00109	GBR	EUR	male
HG00110	GBR	EUR	female
HG00111	GBR	EUR	female
HG00112	GBR	EUR	male
HG00113	GBR	EUR	male
HG00114	GBR	EUR	male
HG00115	GBR	EUR	male
HG00116	GBR	EUR	male
HG00117	GBR	EUR	male
HG00118	GBR	EUR	female
HG00119	GBR	EUR	male
HG00120	GBR	EUR	female
HG00121	GBR	EUR	female
HG00122	GBR	EUR	female
HG00123	GBR	EUR	female
HG00125	GBR	EUR	female
HG00126	GBR	EUR	male
HG00127	GBR	EUR	female
HG00128	GBR	EUR	female
HG00129	GBR	EUR	male
HG00130	GBR	EUR	female
HG00131	GBR	EUR	male
HG00132	GBR	EUR	female
HG00133	GBR	EUR	female
HG00136	GBR	EUR	male
HG00137	GBR	EUR	female
HG00138	GBR	EUR	male
HG00139	GBR	EUR	male
HG00140	GBR	EUR	male
HG00141	GBR	EUR	male
HG00142	GBR	EUR	male
HG00143	GBR	EUR	male
HG00145	GBR	EUR	male
HG00146	GBR	EUR	female
HG00148	GBR	EUR	male
HG00149	GBR	EUR	male
HG00150	GBR	EUR	female
HG00151	GBR	EUR	male



In [ ]:

	rsid	chromosome	position	genotype
909696	rs2833521	21	33179293	AA
909697	rs2833522	21	33179371	AG
909698	rs2833524	21	33182304	--
909699	rs1012482	21	33182482	AG
909700	rs7275656	21	33186188	TT
909701	rs2833527	21	33186415	AG
909702	rs7283466	21	33193131	GG
909703	rs2833530	21	33200096	AA
909704	rs8133196	21	33206932	CC
909705	rs2833536	21	33212509	TT