Me and Bronski's conversation on how to do this
Me: Hey! I want to map nucleotide sequence position after an alignment. I know you have done this before. So I would rather not reinvent the wheel. You did a dictionary in python, but how? Can I see your script? If this feature is embedded in a larger program it might be easier to just explain your strategy.
Bronski: So the strategy is to loop through an aligned sequence and create a dictionary where the keys are the original indices and the values are the indices in the alignment.
Here’s a simple example:
In [71]:
aligned_seq = 'AGC---TTCATCA'
remap_dict = {}
nuc_list = ['A', 'a', 'G', 'g', 'C', 'c', 'T', 't', 'N', 'n']
counter = 0
for xInd, x in enumerate(aligned_seq):
if x in nuc_list:
remap_dict[counter] = xInd
counter += 1
print(remap_dict)
Breakdown of what I have to do:
Use Bio.AlignIO.read()
In [72]:
from Bio import AlignIO
alignment = AlignIO.read("../data/fasta/output_ludwig_eve-striped-2.fa", "fasta")
print(alignment)
In [73]:
for record in alignment:
print(record.id)
Buuuuuuuut, we don't really need the alignment as an alignment per se. But it is important for viewing and testing later. We need to have each seperate sequence, So I am going to use SeqIO.parse.
In [74]:
from Bio import SeqIO
# read in alignment as a list of sequences
records = list(SeqIO.parse("../data/fasta/output_ludwig_eve-striped-2.fa", "fasta"))
In [75]:
# Testing with the first sequence
seqTest = records[0]
#print(seqTest.seq)
print(type(seqTest))
In [76]:
# Turn just the sequence into a string instead of fasta sequence
aligned_seq = str(seqTest.seq)
print(type(aligned_seq)) # check
Notes on loop
enumerate()
: prints out numbers counting upxInd
is the keys that were enumerated. remap_dict[counter] = xInd
makes the dictionary
x is the nucleotide
In [83]:
remap_dict = {}
nuc_list = ['A', 'a', 'G', 'g', 'C', 'c', 'T', 't', 'N', 'n']
counter = 0
for xInd, x in enumerate(aligned_seq):
if x in nuc_list:
remap_dict[counter] = xInd
counter += 1
#checking dictionary created
print(len(remap_dict)) # should be length of alignment
print(remap_dict[40]) #should print the value of the number key
print(type(remap_dict[40])) #Check data type
We need two sequences. One that is not the alignment.
In [78]:
## Attempt at vector
bcdSites = [0] * len(aligned_seq)
#from loctaingTFB.ipy
TFBS = [10, 102, 137, -741, -680, -595, 309, -497, -485, 429, 453, 459, 465, -376, -347, -339, -308, 593, 600, -289, 613, 623, -240, 679, -128, -77, 825, 826, 886]
#Need to make positive. This is not what I need.
#TFBS_pos = [abs(k) for k in TFBS]
print((TFBS))
m = 7
# This is the range of the motif
for pos in TFBS:
print(aligned_seq[pos:pos+m])
Now I need to make a new vector that says if the bicoid site is present or absent on the position. Can the negative positions be used in a query of the dictionary? Likely not.
In [118]:
print(type(TFBS))
print(type(remap_dict))
print(TFBS)
# Okay, the problem is the negative numbers
another_key = [82, 85, 98]
print(len(remap_dict))
# So I need to convert the negative number first.
print([remap_dict[x] for x in another_key])
In [128]:
# Working on converting TFBS negative numbers.
TFBS_2 = []
for x in TFBS:
if x < 0:
TFBS_2.append(905 + x)
else:
TFBS_2.append(x)
print(TFBS_2)