Solution to this week's problem


In [11]:
def sum_numbers_i(n):
    if n == 0:
        return 0
    else:
        return n + sum_numbers_i(n - 1)

def sum_numbers_ii(n):
    return sum(range(n + 1))

def sum_numbers_iii(n):
    return n * (n + 1) / 2

In [12]:
%timeit sum_numbers_i(500)


10000 loops, best of 3: 107 µs per loop

In [13]:
%timeit sum_numbers_ii(500)


100000 loops, best of 3: 7.63 µs per loop

In [14]:
%timeit sum_numbers_iii(500)


1000000 loops, best of 3: 265 ns per loop

Efficient file handling

Lustre is designed for big read/writes. If you can read the complete file at once, then do so. Repeatedly reading a single file one line at a time can be slow.

Example

I have a file of 1's located in my scratch space. Reading it one line at a time:


In [7]:
def read_ones_ver1():
    """
    Read the file one line at a time . Your allocated RAM better be big enough.
    """
    with open('/lustre/scratch113/teams/barrett/users/dr9/ones.txt') as f:
        for line in f:
            times2 = int(line) * 2

Farm Results

Fri Feb 13 12:17:18: Started on <bc-30-3-08>, Execution Home </nfs/users/nfs_d/dr9>, Execution CWD </nfs/users/nfs_d/dr9>;
Fri Feb 13 12:17:42: Done successfully. The CPU time used is 23.5 seconds.
Max Memory: 11 MB
Average Memory: 11.00 MB

But if I use the read() function, the whole file is read:


In [5]:
def read_ones_ver2():
    """
    Read the whole file at once. Your allocated RAM better be big enough.
    """
    with open('/lustre/scratch113/teams/barrett/users/dr9/ones.txt') as f:
        lines = f.read()

    for line in lines:
        times2 = int(line) * 2

Results

Fri Feb 13 12:16:04: Started on <bc-29-2-15>, Execution Home </nfs/users/nfs_d/dr9>, Execution CWD </nfs/users/nfs_d/dr9>;
Fri Feb 13 12:16:08: Exited with exit code 1. The CPU time used is 2.5 seconds.
Max Memory: 165 MB
Average Memory: 165.00 MB

VCF

Variant Call Format (VCF) files are text based files (though often compressed) that contain variant calls and associated information.

The general structure is:

##Meta-information lines
.
.
.
#A header line
variant call 1
variant call 2
.
.
.
variant call n

Meta-information

These lines contain information that includes:

  • the version of the VCF file
  • what program created the file
  • the format of some of the attributes contained within each variant call

The following is an example of meta-information lines:

##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">

Header line

The header line has the following tab-delimited fields :

CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO

As well as the standard fields, if genotype data is present in the file, these are followed by a FORMAT column header, then an arbitrary number of sample IDs. For example:

#CHROM   POS   ID   REF ALT    QUAL   FILTER   INFO   FORMAT   NA00001   NA00002   NA00003

Variant calls

The rest of the the file consists of variant calls with data filling in each of the header line fields. As with the header line, each entry is seperated by tabs. Furthermore, some of these entries are contain lists of information seperated by various characters (, and ;). An example of some variants calls:

20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3   0/0:41:3
20     1110696 rs6040355 A      G,T     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2   2/2:35:4
20     1230237 .         T      .       47   PASS   NS=3;DP=13;AA=T                   GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20     1234567 microsat1 GTCT   G,GTACT 50   PASS   NS=3;DP=9;AA=G                    GT:GQ:DP    0/1:35:4       0/2:17:2       1/1:40:3

The examples cited are from the following pages which also contains more in depth information


The problem

You are trying to find the variant call at position 1110696 along with it's ancestral allele in a VCF file.

Solutions

Solution 1: Write a parser to split each line by '\t', count the number of columns until you reach your desired field of interest. If this column is semicolon delimited, split on this character. Search the results until you find datapoint.

Solution 2: Use PyVCF


PyVCF

PyVCF is a convenient package that parses Variant Call Format files into objects for easy data access. For example, we can iterate over all of the lines in this file with the following code:


In [10]:
import vcf

vcf_reader = vcf.Reader(open('/Users/dr9/Developer/PyVCF/vcf/test/example-4.0.vcf', 'r'))

for record in vcf_reader:
	print record


Record(CHROM=20, POS=14370, REF=G, ALT=[A])
Record(CHROM=20, POS=17330, REF=T, ALT=[A])
Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
Record(CHROM=20, POS=1230237, REF=T, ALT=[None])
Record(CHROM=20, POS=1234567, REF=GTCT, ALT=[G, GTACT])

The solution to our problem is therefore:


In [27]:
import vcf

vcf_reader = vcf.Reader(open('/Users/dr9/Developer/PyVCF/vcf/test/example-4.0.vcf', 'r'))

for record in vcf_reader:
    if record.POS == 1110696:
        print record.INFO['AA']


T

Each Record gives us all of the information with that variant call. This object has all sorts of methods and attributes to make our lives easier.

Exercise 1.

Use the dir function lets checkout all of the available attributes for a record.

Hint: The for loop above is not neccessary. We also just use:

rec = vcf_reader.next()

Exercise 2

Calcuate the proprtion of variant calls in /lustre/scratch113/teams/barrett/users/dr9/vcfs/example-4.0.vcf that pass the filters.

We can get a list of dictionaries containing the parsed sample column and record using the sampleattribute:


In [36]:
vcf_reader = vcf.Reader(open('/Users/dr9/Developer/PyVCF/vcf/test/example-4.0.vcf', 'r'))
record = vcf_reader.next()
for sample in record.samples:
    print sample['GT']


0|0
1|0
1/1

The genotypes are represented by Call objects, which have three attributes

  • the corresponding Record site
  • the sample name in sample
  • a dictionary of call data in data

As seen below:


In [37]:
vcf_reader = vcf.Reader(open('/Users/dr9/Developer/PyVCF/vcf/test/example-4.0.vcf', 'r'))
record = vcf_reader.next()
call = record.genotype('NA00001')
print call.site


Record(CHROM=20, POS=14370, REF=G, ALT=[A])