Forensics



In [2]:

    
import skbio

We have samples of bacterial DNA from three people's hands and three peoples keyboards. Can we figure out which individual belongs to which keyboard by seeing which keyboards and which hands have the most bacterial DNA sequences in common?

First, we'll download some gzipped sequence file. Let's find out what type of file it is...



In [3]:

    
fp = "forensic-seqs.fna"



In [4]:

    
skbio.io.sniff(fp)









    Out[4]:





('fasta', {})

Ok, it's fasta. But there is a lot of metadata embedded in the sample ids. Let's be a little more sane, and propagate that to sequence metadata.



In [5]:

    
skbio.io.sniff(fp)









    Out[5]:





('fasta', {})

Do some fancy sequence exploration to show off the repr...



In [6]:

    
for rec in skbio.io.read(fp, format='fasta'):
    print(rec, rec.id)    
    break









    



(<BiologicalSequence: CTGGACCGTG... (length: 229)>, 'M2.Thumb.R_1')

Hash the sequences on a per subject basis



In [7]:

    
from collections import defaultdict
data = defaultdict(list)
for rec in skbio.io.read(fp, format='fasta'):
    subject = rec.id.split('.')[0]
    data[subject].append(rec)

Figure out how many sequences we have for each subject, and then how many unique sequences we have for each subject



In [8]:

    
for k, v in data.items():
    print k, len(v), len(set(v))









    



K3 9597 4019
K2 6477 1366
K1 6636 1732
M3 3945 1264
M2 3127 1413
M9 4432 1079



In [9]:

    
for k, v in data.items():
    data[k] = set(v)
    print k, len(data[k])

The largest union of sequences always provides the best match of subject to keyboard.



In [10]:

    
# maybe load these data into a distance matrix to show that repr?
for i in ['M2', 'M3', 'M9']:
    for j in ['K1', 'K2', 'K3']:
        print i, j, len(data[i] & data[j])
    print ''



In [ ]:



In [ ]: