In [2]:
import skbio
We have samples of bacterial DNA from three people's hands and three peoples keyboards. Can we figure out which individual belongs to which keyboard by seeing which keyboards and which hands have the most bacterial DNA sequences in common?
First, we'll download some gzipped sequence file. Let's find out what type of file it is...
In [3]:
fp = "forensic-seqs.fna"
In [4]:
skbio.io.sniff(fp)
Out[4]:
Ok, it's fasta. But there is a lot of metadata embedded in the sample ids. Let's be a little more sane, and propagate that to sequence metadata.
In [5]:
skbio.io.sniff(fp)
Out[5]:
Do some fancy sequence exploration to show off the repr...
In [6]:
for rec in skbio.io.read(fp, format='fasta'):
print(rec, rec.id)
break
Hash the sequences on a per subject basis
In [7]:
from collections import defaultdict
data = defaultdict(list)
for rec in skbio.io.read(fp, format='fasta'):
subject = rec.id.split('.')[0]
data[subject].append(rec)
Figure out how many sequences we have for each subject, and then how many unique sequences we have for each subject
In [8]:
for k, v in data.items():
print k, len(v), len(set(v))
In [9]:
for k, v in data.items():
data[k] = set(v)
print k, len(data[k])
The largest union of sequences always provides the best match of subject to keyboard.
In [10]:
# maybe load these data into a distance matrix to show that repr?
for i in ['M2', 'M3', 'M9']:
for j in ['K1', 'K2', 'K3']:
print i, j, len(data[i] & data[j])
print ''
In [ ]:
In [ ]: