In [11]:
def sum_numbers_i(n):
if n == 0:
return 0
else:
return n + sum_numbers_i(n - 1)
def sum_numbers_ii(n):
return sum(range(n + 1))
def sum_numbers_iii(n):
return n * (n + 1) / 2
In [12]:
%timeit sum_numbers_i(500)
In [13]:
%timeit sum_numbers_ii(500)
In [14]:
%timeit sum_numbers_iii(500)
In [7]:
def read_ones_ver1():
"""
Read the file one line at a time . Your allocated RAM better be big enough.
"""
with open('/lustre/scratch113/teams/barrett/users/dr9/ones.txt') as f:
for line in f:
times2 = int(line) * 2
Fri Feb 13 12:17:18: Started on <bc-30-3-08>, Execution Home </nfs/users/nfs_d/dr9>, Execution CWD </nfs/users/nfs_d/dr9>;
Fri Feb 13 12:17:42: Done successfully. The CPU time used is 23.5 seconds.
Max Memory: 11 MB
Average Memory: 11.00 MB
But if I use the read()
function, the whole file is read:
In [5]:
def read_ones_ver2():
"""
Read the whole file at once. Your allocated RAM better be big enough.
"""
with open('/lustre/scratch113/teams/barrett/users/dr9/ones.txt') as f:
lines = f.read()
for line in lines:
times2 = int(line) * 2
Fri Feb 13 12:16:04: Started on <bc-29-2-15>, Execution Home </nfs/users/nfs_d/dr9>, Execution CWD </nfs/users/nfs_d/dr9>;
Fri Feb 13 12:16:08: Exited with exit code 1. The CPU time used is 2.5 seconds.
Max Memory: 165 MB
Average Memory: 165.00 MB
Variant Call Format (VCF) files are text based files (though often compressed) that contain variant calls and associated information.
The general structure is:
##Meta-information lines
.
.
.
#A header line
variant call 1
variant call 2
.
.
.
variant call n
These lines contain information that includes:
The following is an example of meta-information lines:
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
The header line has the following tab-delimited fields :
CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO
As well as the standard fields, if genotype data is present in the file, these are followed by a FORMAT column header, then an arbitrary number of sample IDs. For example:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
The rest of the the file consists of variant calls with data filling in each of the header line fields. As with the header line, each entry is seperated by tabs. Furthermore, some of these entries are contain lists of information seperated by various characters (, and ;). An example of some variants calls:
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
The examples cited are from the following pages which also contains more in depth information
You are trying to find the variant call at position 1110696 along with it's ancestral allele in a VCF file.
Solution 1: Write a parser to split each line by '\t', count the number of columns until you reach your desired field of interest. If this column is semicolon delimited, split on this character. Search the results until you find datapoint.
Solution 2: Use PyVCF
PyVCF is a convenient package that parses Variant Call Format files into objects for easy data access. For example, we can iterate over all of the lines in this file with the following code:
In [10]:
import vcf
vcf_reader = vcf.Reader(open('/Users/dr9/Developer/PyVCF/vcf/test/example-4.0.vcf', 'r'))
for record in vcf_reader:
print record
The solution to our problem is therefore:
In [27]:
import vcf
vcf_reader = vcf.Reader(open('/Users/dr9/Developer/PyVCF/vcf/test/example-4.0.vcf', 'r'))
for record in vcf_reader:
if record.POS == 1110696:
print record.INFO['AA']
Each Record
gives us all of the information with that variant call. This object has all sorts of methods and attributes to make our lives easier.
Use the dir
function lets checkout all of the available attributes for a record.
Hint:
The for
loop above is not neccessary. We also just use:
rec = vcf_reader.next()
We can get a list of dictionaries containing the parsed sample column and record using the sample
attribute:
In [36]:
vcf_reader = vcf.Reader(open('/Users/dr9/Developer/PyVCF/vcf/test/example-4.0.vcf', 'r'))
record = vcf_reader.next()
for sample in record.samples:
print sample['GT']
The genotypes are represented by Call
objects, which have three attributes
As seen below:
In [37]:
vcf_reader = vcf.Reader(open('/Users/dr9/Developer/PyVCF/vcf/test/example-4.0.vcf', 'r'))
record = vcf_reader.next()
call = record.genotype('NA00001')
print call.site