Forensics


In [2]:
import skbio

We have samples of bacterial DNA from three people's hands and three peoples keyboards. Can we figure out which individual belongs to which keyboard by seeing which keyboards and which hands have the most bacterial DNA sequences in common?

First, we'll download some gzipped sequence file. Let's find out what type of file it is...


In [3]:
fp = "forensic-seqs.fna"

In [4]:
skbio.io.sniff(fp)


Out[4]:
('fasta', {})

Ok, it's fasta. But there is a lot of metadata embedded in the sample ids. Let's be a little more sane, and propagate that to sequence metadata.


In [5]:
skbio.io.sniff(fp)


Out[5]:
('fasta', {})

Do some fancy sequence exploration to show off the repr...


In [6]:
for rec in skbio.io.read(fp, format='fasta'):
    print(rec, rec.id)    
    break


(<BiologicalSequence: CTGGACCGTG... (length: 229)>, 'M2.Thumb.R_1')

Hash the sequences on a per subject basis


In [7]:
from collections import defaultdict
data = defaultdict(list)
for rec in skbio.io.read(fp, format='fasta'):
    subject = rec.id.split('.')[0]
    data[subject].append(rec)

Figure out how many sequences we have for each subject, and then how many unique sequences we have for each subject


In [8]:
for k, v in data.items():
    print k, len(v), len(set(v))


K3 9597 4019
K2 6477 1366
K1 6636 1732
M3 3945 1264
M2 3127 1413
M9 4432 1079

In [9]:
for k, v in data.items():
    data[k] = set(v)
    print k, len(data[k])


K3 4019
K2 1366
K1 1732
M3 1264
M2 1413
M9 1079

The largest union of sequences always provides the best match of subject to keyboard.


In [10]:
# maybe load these data into a distance matrix to show that repr?
for i in ['M2', 'M3', 'M9']:
    for j in ['K1', 'K2', 'K3']:
        print i, j, len(data[i] & data[j])
    print ''


M2 K1 88
M2 K2 77
M2 K3 391

M3 K1 297
M3 K2 214
M3 K3 191

M9 K1 221
M9 K2 268
M9 K3 200


In [ ]:


In [ ]: