Me and Bronski's conversation on how to do this
Me: Hey! I want to map nucleotide sequence position after an alignment. I know you have done this before. So I would rather not reinvent the wheel. You did a dictionary in python, but how? Can I see your script? If this feature is embedded in a larger program it might be easier to just explain your strategy.
Bronski: So the strategy is to loop through an aligned sequence and create a dictionary where the keys are the original indices and the values are the indices in the alignment.
Here’s a simple example:
In [10]:
aligned_seq = 'AGC---TTCATCA'
remap_dict = {}
nuc_list = ['A', 'a', 'G', 'g', 'C', 'c', 'T', 't', 'N', 'n']
counter = 0
for xInd, x in enumerate(aligned_seq):
if x in nuc_list:
remap_dict[counter] = xInd
counter += 1
print(remap_dict)
Breakdown of what I have to do:
Use Bio.AlignIO.read()
In [20]:
from Bio import AlignIO
alignment = AlignIO.read("../data/fasta/output_ludwig_eve-striped-2.fa", "fasta")
print(alignment)
In [12]:
for record in alignment:
print(record.id)
Buuuuuuuut, we don't really need the alignment as an alignment per se. But it is important for viewing and testing later. We need to have each seperate sequence, So I am going to use SeqIO.parse.
In [13]:
from Bio import SeqIO
# read in alignment as a list of sequences
records = list(SeqIO.parse("../data/fasta/output_ludwig_eve-striped-2.fa", "fasta"))
In [14]:
# Testing with the first sequence
seqTest = records[0]
#print(seqTest.seq)
print(type(seqTest))
In [15]:
# Turn just the sequence into a string instead of fasta sequence
aligned_seq = str(seqTest.seq)
print(type(aligned_seq)) # check
Notes on loop
enumerate()
: prints out numbers counting upxInd
is the keys that were enumerated. remap_dict[counter] = xInd
makes the dictionary
x is the nucleotide
In [33]:
remap_dict = {}
nuc_list = ['A', 'a', 'G', 'g', 'C', 'c', 'T', 't', 'N', 'n']
counter = 0
for xInd, x in enumerate(aligned_seq):
if x in nuc_list:
remap_dict[counter] = xInd
counter += 1
#checking dictionary created
print(len(remap_dict)) # should be length of alignment
print(remap_dict[40]) #should print the value of the number key
print(type(remap_dict[40])) #Check data type
In [52]:
## Attempt at vector
bcdSites = [0] * len(aligned_seq)
#from loctaingTFB.ipy
TFBS = [10, 102, 137, -741, -680, -595, 309, -497, -485, 429, 453, 459, 465, -376, -347, -339, -308, 593, 600, -289, 613, 623, -240, 679, -128, -77, 825, 826, 886]
#Need to make positive.
TFBS_pos = [abs(k) for k in TFBS]
# Or do I need to subtract from the end position...
# something like this
for i in TFBS:
if i < 0:
len(aligned_seq - i)
else:
i
print(TFBS)
In [ ]:
In [ ]: