Author: Teotonio Soares de Carvalho
Please contact me at teotonio@msu.edu if you have any suggestions or questions.
In [5]:
%notebook -f py functions_definition.ipynb
execfile("./functions_definition.py")
In [5]:
# If you get the message: "IOError: Please, start an IPython cluster to proceed." that means that you whoud go
# to the home page where the notebooks are listed. Look for the "cluster" tab in that page, click on it.
# Click in the field "# of engines". Insert the number of cluster (use the same number of processors
# that you have, probably 4) and click "start". Wait a few seconds and rerun the cell above.
In [2]:
help(main_visualization)
In [4]:
# Example with more than one reference
main_visualization(reference_seq_ids = ["AB353450", "DQ278583", "AB576777"],
classes = ["Nitrososphaera", "Nitrososphaera sister", "Nitrosotalea"],
aligned_fasta_file = "./data/aligned_not_amb",
out_fasta = "./data/ordered_records",
similarity_threshold = .80, # Be careful with the similarity threshold you use.
composition_name = "./Example.pdf",
deduplicate = True)
In [5]:
# Example with one reference
main_visualization(reference_seq_ids = ["AB353450"],
classes = ["Nitrososphaera"],
aligned_fasta_file = "./data/aligned_not_amb",
out_fasta = "./data/ordered_records",
similarity_threshold = .80,
composition_name = "./Example2.pdf",
deduplicate = True)
The purpose of this step is to evaluate the primer quality along all positions in the alignment. This will be further used to select combinations of positions where primers will be enumerated. The k-mers are then filtered based on criteria defined in the function "screen oligos".
In [6]:
help(screen_oligos)
In [9]:
file_tested_algn = "./data/algn_contigs_Nitrosopumilus_cluster.fasta"
file_tested_not_algn = "./data/unalgn_contigs_Nitrosopumilus_cluster.fasta"
In [10]:
screening = screen_oligos(fasta_file = file_tested_algn,
kmer_sizes = [20, 21, 22, 23, 24, 25, 26, 27, 28],
step = 1,
hairpin_tm_max = 35,
homo_tm_max = 35,
tm_max = 75,
tm_min = 40,
min_occurrence = 50,
no_3_T = False,
no_poly_3_GC = True,
no_poly_run = True,
max_degen = 60,
max_gap_prop = .01,
primer_conc = 200,
dv_conc = 1.5,
mv_conc = 50,
min_diff = 0)
In [11]:
%reload_ext rpy2.ipython
In [12]:
data = screening["data"]
data = data.ix[data.Median_Tm > 0, :].dropna()
In [13]:
data.to_csv("./data/coverage_data.txt")
In [14]:
%%R -i data -w 900 -h 600
library(ggplot2)
library(reshape2)
sub_data = data[, c("Pos", "Coverage", "Richness", "Tm_10", "Tm_90")]
sub_data$cover = ifelse(sub_data$Coverage > .6, ">60%", "<60%")
molt_data = melt(sub_data,
id.vars = c("Pos", "cover", "Tm_10", "Tm_90"))
theme_set(theme_bw(16))
ggplot(molt_data) +
aes(x = Pos, y = value) +
facet_wrap(~variable, scale = "free", ncol = 1) +
scale_x_continuous(breaks = seq(0, 800, 10)) +
scale_colour_manual(values = c("black", "red")) +
geom_point(aes(colour = factor(cover))) +
geom_line(alpha = .4) +
theme(axis.text.x=element_text(angle=90, size = 10)) +
labs(colour = "Coverage", y = "", x = "Position in the Alignment")
Coverage (proportional) and richness (number of unique k-mers) of the n top primers (up to max_degen) for each position in the alignment. Primers are enumerated for each position as all k-mers with k varying from 20 to 28 taken from all sequences in the aligment. These k-mers are then filtered based on criteria specified in the "screen_oligos" function above (min Tm, max Tm, max hairpin Tm, max homo and heterodimers Tm, among others).
In [15]:
%%R -w 1300 -h 500
data$cover = ifelse(data$Coverage > .6, ">60%", "<60%")
ggplot(data) +
aes(x = Pos, y = Median_Tm) +
geom_ribbon(aes(ymax = Tm_90,
ymin = Tm_10),
alpha = .2) +
geom_line(alpha = .5) +
geom_point(aes(colour = cover), size = 1.5) +
scale_colour_manual(values = c("black", "red")) +
scale_x_continuous(breaks = seq(0, 800, 10)) +
scale_y_continuous(breaks = seq(0, 100, 1)) +
theme(axis.text.x=element_text(angle=90, size = 10)) +
labs(colour = "Coverage", y = "Melting Temperature (°C)", x = "Positions")
Predicted melting temperature for the top n primers (up to max_degen). The poings indicates the median melting temperature at each position. The red colour indicates positions where the coverage of the best primers is above 60%. The shaded area representes region between the 10th to the 90th percentile for the melting temperature of the best primers at each position. The information given by this graph helps to choose regions from where compatible primers could be picked.
Based on the last graph, I decided that the region from 90-105 have potentially good primers. Because I want amplicons with approximately 200 bp for compatibility with qPCR and illumina sequencing, I chose the region from 340-370 to pick the reverse primers. The region 270-290 have good coverage, but the melting temperature is too low compared to the region 90-105, that is why I didn't use it.
In [16]:
max_degen = 60
In [17]:
help(enumerate_primers)
In [18]:
fwd_unique, rev_unique, redundancy, pairs, seqs_detected = \
enumerate_primers(target_file_name = file_tested_algn,
hairpin_tm_max = 35,
primer_conc = 200,
homo_tm_max = 30,
kmer_sizes = [20, 21, 22, 23, 24, 25, 26, 27, 28],
tm_min = 55,
tm_max = 60,
fwd_starts = range(90, 105),
rev_starts = range(340, 370),
min_occurrence = 50,
no_3_T = True,
no_poly_3_GC = True,
no_poly_run=True,
max_degen = max_degen)
This function below select the primers as pairs.
I decided to do the removal of heterodimers before the selection. So the selection is about three times slower than before.
In [19]:
best = select_pairs(fwd_dict = fwd_unique,
rev_dict = rev_unique,
redundancy_dict = redundancy,
pairs = pairs,
min_increase=20,
max_degen = max_degen)
When all the best pairs of primers are selected and the max_degen is not reached, more primers are added if they detect perfectly sequences that were not previously detected. They are only selected if they have good quality.
I still did not implement the heterodimer removal whitin the increase_degeneracy function. But I added a function "test_heterodimer" to check it later.
In [20]:
fwds = increase_degeneracy(included_oligos = best["fwd"],
redundancy = redundancy["fwd"],
oligo_series = fwd_unique,
min_increase = 50)
In [21]:
revs = increase_degeneracy(included_oligos = best["rev"],
redundancy = redundancy["rev"],
oligo_series = rev_unique,
min_increase = 50)
In [22]:
len(fwds)
Out[22]:
In [23]:
len(revs)
Out[23]:
Just to make sure that there is no potential heterodimers.
In [24]:
revs_comp = [str(Seq(rev).reverse_complement()) for rev in best["rev"]]
If you find heterodimers here, you will need to remove them manually.
In [25]:
test_heterodimers_post(fwds,
revs_comp,
primer_conc = 200,
mv_conc = 50,
dv_conc = 1.5,
max_ht_tm = 35)
Below I show all the selected oligos and their predicted melting temperature.
In [26]:
tm = map(partial(primer3.calcTm, dna_conc = 200/float(max_degen), dv_conc = 1.5, mv_conc = 50), best["fwd"])
print "Primer \t\t\t\t Tm"
for fwd, tm in zip(best["fwd"], tm):
print "%s\t:%d" % (fwd, tm)
In [27]:
tm = map(partial(primer3.calcTm, dna_conc = 200/float(max_degen),
dv_conc = 1.5, mv_conc = 50), revs_comp)
print "Primer \t\t\t\t Tm"
for rev, tm in zip(revs_comp, tm):
print "%s\t:%d" % (rev, tm)
In [28]:
database = read_fasta(file_tested_not_algn)
database = [(rec.id, str(rec.seq)) for rec in database]
In [29]:
len(database)
Out[29]:
A first evaluation of the potential coverage of the primers is done using regular expressions, which does not take into account the thermodynamics properties of the oligos but only the bases identity. Allowing for one mismatch, the coverage is predicted to be about 96%. The problem of this approach is that it doesn't take into account the impact that different mismatches can have.
In [30]:
search_oligos(fwds = fwds, revs = revs, substitution = 1, database = database)
Allowing two mismatches
In [31]:
search_oligos(fwds = fwds, revs = revs, substitution = 2, database = database)
This section was made with the purpose of reducing the cost of testing all primers as pairs. I will use MFEprimer in the end of the notebook to test the specificity and the coverage of the primers. The problem is that MFEprimer only accepts primers as pairs. Consequently, 60 forward subprimers combined with 60 reverse subprimers would lead to 3,600 pairs that should be tested. To avoid that, I added two long oligos to each end of the target sequence to be used as a dummy forward and reverse primer. In this way, instead of 3,600 independent searches, only 120 are necessary. The results are them combined to assess the performance of the primers as pairs using dataframe operations in pandas.
In [33]:
# This has to be the dealigned sequences file
file_to_test = "./data/unalgn_contigs_Nitrosopumilus_cluster.fasta"
In [42]:
# This will never overwrite an existing mfe directory. You need to delete the folder mfe manually
# if you want to overwrite it.
if not os.path.isdir("./data/mfe"):
os.makedirs("./data/mfe")
records = read_fasta(file_to_test)
for record in records:
record.seq = Seq(("AGTATCCTGGATGGGAGGATGGCAACCACTAAAGCCAGTC" +\
str(record.seq) +\
"CAACAGATTCATACATAAATTATAAGCACCTCATCACTGT"))
write_fasta("./data/mfe/database.genomic", records)
else:
print("If you want to rewrite the folder mfe in ./data/ remove it first and then rerun this cell.")
print("Aborted!")
The step in the below is slow ~5min, but need to be done only once. That is why I commented it.
In [83]:
!./MFEprimer/IndexDb.sh ./data/mfe/database.genomic
In [44]:
fwd_dummy = "AGTATCCTGGATGGGAGGATGGCAACCACTAAAGCCAGTC"
rev_dummy = str(Seq("CAACAGATTCATACATAAATTATAAGCACCTCATCACTGT").reverse_complement())
In [47]:
fwd_primer = fwds
task = run_mfe_parallel(database = "./data/mfe/database.genomic",
output = "./data/out_fwd.txt",
ppc = 30,
min_tm = 47,
fwd_list = fwd_primer,
rev_list = [rev_dummy],
oligo_conc = 200/float(len(fwd_primer)))
task.successful()
Out[47]:
In [48]:
rev_primer = revs_comp
task = run_mfe_parallel(database = "./data/mfe/database.genomic",
output = "./data/out_rev.txt",
ppc = 30,
min_tm = 47,
fwd_list = [fwd_dummy],
rev_list = rev_primer,
oligo_conc = 200/float(len(rev_primer)))
task.successful()
Out[48]:
In [50]:
%reload_ext rpy2.ipython
In [51]:
data = process_data(fwd_data_name = "./data/out_fwd.txt",
rev_data_name = "./data/out_rev.txt",
delta_tm = 10)
In [52]:
data_best_match = data.ix[(data.FpTm > 45) &
(data.RpTm > 45) &
(data.DeltaTm < 8), :]
In [53]:
records_dummy = read_fasta("./data/mfe/database.genomic")
cover = len(data_best_match.HitID.unique())
print "Predicted coverage is %d out of %d." % (cover, len(records_dummy))
In [55]:
%pylab inline
detection_fwd = data_best_match.fwd_primer.value_counts()
detection_fwd = detection_fwd / float(len(records_dummy))
detection_fwd.cumsum().plot(kind = "bar", figsize = (12, 4))
Out[55]:
In [56]:
detection_rev = data_best_match.rev_primer.value_counts()
detection_rev = detection_rev / float(len(records_dummy))
detection_rev.cumsum().plot(kind = "bar", figsize = (12, 4))
Out[56]:
Because of the cost of primer synthesis, I decided to try to eliminate as many primers as possible without decreasing the coverage dramatically.
In [58]:
data_raw = process_data(fwd_data_name = "./data/out_fwd.txt",
rev_data_name = "./data/out_rev.txt",
delta_tm = 10,
return_raw = True) # This is important
In [59]:
data_filtered = remove_redundant_all(data_raw = data_raw,
min_increase = 10)
In [60]:
fwd_degen = list(data_filtered.fwd_primer.unique())
In [61]:
rev_degen_comp = list(data_filtered.rev_primer.unique())
In [62]:
for fwd in fwd_degen:
print fwd
In [63]:
for rev in rev_degen_comp:
print rev
In [64]:
fwd_dummy = "AGTATCCTGGATGGGAGGATGGCAACCACTAAAGCCAGTC"
rev_dummy = str(Seq("CAACAGATTCATACATAAATTATAAGCACCTCATCACTGT").reverse_complement())
In [65]:
fwd_primer = fwd_degen
task = run_mfe_parallel(database = "./data/mfe/database.genomic",
output = "./data/out_red_fwd.txt",
ppc = 30,
min_tm = 50,
fwd_list = fwd_primer,
rev_list = [rev_dummy],
oligo_conc = 200/float(len(fwd_primer)))
task.successful()
Out[65]:
In [66]:
rev_primer = rev_degen_comp
task = run_mfe_parallel(database = "./data/mfe/database.genomic",
output = "./data/out_red_rev.txt",
ppc = 30,
min_tm = 50,
fwd_list = [fwd_dummy],
rev_list = rev_primer,
oligo_conc = 200/float(len(rev_primer)))
task.successful()
Out[66]:
In [67]:
data = process_data(fwd_data_name = "./data/out_red_fwd.txt",
rev_data_name = "./data/out_red_rev.txt",
delta_tm = 10)
In [68]:
data_best_match = data.ix[(data.FpTm > 50) &
(data.RpTm > 50) &
(data.DeltaTm < 8), :]
In [69]:
%pylab inline
detection_fwd = data_best_match.fwd_primer.value_counts()
detection_fwd = detection_fwd / float(len(records_dummy))
detection_fwd.cumsum().plot(kind = "bar", figsize = (12, 4))
Out[69]:
In [70]:
detection_fwd
Out[70]:
In [71]:
detection_rev = data_best_match.rev_primer.value_counts()
detection_rev = detection_rev / float(len(records_dummy))
detection_rev.cumsum().plot(kind = "bar", figsize = (12, 4))
Out[71]:
In [72]:
detection_rev
Out[72]:
In [73]:
calcTm = lambda x: primer3.calcTm(x, mv_conc = 50, dv_conc = 1.5, dna_conc = 200/float(len(fwd_primer)))
p = pandas.DataFrame(map(calcTm, fwd_primer)).sort(inplace = False)
In [74]:
%%R -i p -w 600 -h 300
library(ggplot2)
p$sub = rownames(p)
theme_set(theme_bw())
ggplot(p) +
aes(y = X0 + 1,
x = reorder(sub, X0)) +
geom_point() +
scale_x_discrete(breaks = NULL) +
scale_y_continuous(breaks = seq(30, 100, 1)) +
labs(x = "Sub-primers", y = "Melting Temperature °C", title = "Forward")
In [75]:
calcTm = lambda x: primer3.calcTm(x, mv_conc = 50, dv_conc = 1.5, dna_conc = 200/float(len(rev_primer)))
p = pandas.DataFrame(map(calcTm, rev_primer)).sort(inplace = False)
In [76]:
%%R -i p -w 600 -h 300
library(ggplot2)
p$sub = rownames(p)
theme_set(theme_bw())
ggplot(p) +
aes(y = X0 + 1,
x = reorder(sub, X0)) +
geom_point() +
scale_x_discrete(breaks = NULL) +
scale_y_continuous(breaks = seq(30, 100, 1)) +
labs(x = "Sub-primers", y = "Melting Temperature °C", title = "Reverse")
In [77]:
cover = len(data_best_match.HitID.unique())
print "Predicted coverage is %d out of %d." % (cover, len(records_dummy))
The melting temperature here might be slightly different from the originally predicted for following reasons:
In [78]:
%%R -i data_best_match -w 900 -h 500
library(ggplot2)
library(reshape2)
grouped <- data_best_match
theme_set(theme_bw())
ggplot(grouped) +
geom_abline(intercept = median(grouped$FpTm),
slope = 0,
colour = "red") +
geom_abline(intercept = quantile(grouped$FpTm, .95),
slope = 0,
linetype = 2,
colour = "red") +
geom_abline(intercept = quantile(grouped$FpTm, .05),
slope = 0,
linetype = 2,
colour = "red") +
aes(x = reorder(fwd_primer, FpTm), y = FpTm) +
geom_jitter(alpha = .1) +
scale_y_continuous(breaks = seq(45, 70, .5)) +
labs(x = "Forward Primers", y = "Forward Melting Temperature °C",
alpha = "Unique Hits \n per Assay") +
theme(axis.text.x = element_text(angle=70, hjust = 1))
Melting temperature (y axis) for all forward primers (x axis). Each point represents the best hit for each sequence in the target database. Transparency is used to facilitate the visualization of the density. The dashed lines indicates the 10th and 90th percentiles and the solid line represents the median.
In [79]:
%%R -w 900 -h 500
theme_set(theme_bw())
ggplot(grouped) +
geom_abline(intercept = median(grouped$RpTm),
slope = 0,
colour = "red") +
geom_abline(intercept = quantile(grouped$RpTm, .95),
slope = 0,
linetype = 2,
colour = "red") +
geom_abline(intercept = quantile(grouped$RpTm, .05),
slope = 0,
linetype = 2,
colour = "red") +
aes(x = reorder(rev_primer, RpTm), y = RpTm) +
geom_jitter(alpha = .1) +
scale_y_continuous(breaks = seq(45, 70, .5)) +
labs(x = "Reverse Primers", y = "Reverse Melting Temperature °C",
alpha = "Unique Hits \n per Assay") +
theme(axis.text.x = element_text(angle=70, hjust = 1))
Melting temperature (y axis) for all reverse primers (x axis). Each point represents the best hit for each sequence in the target database. Transparency is used to facilitate the visualization of the density. The dashed lines indicates the 10th and 90th percentiles and the solid line represents the median.
In [80]:
%%R -w 900 -h 500
theme_set(theme_bw())
ggplot(grouped) +
aes(x = reorder(rev_primer, DeltaTm), y = DeltaTm) +
geom_abline(intercept = median(grouped$DeltaTm),
slope = 0,
colour = "red") +
geom_abline(intercept = quantile(grouped$DeltaTm, .95),
slope = 0,
linetype = 2,
colour = "red") +
geom_abline(intercept = quantile(grouped$DeltaTm, .05),
slope = 0,
linetype = 2,
colour = "red") +
geom_jitter(alpha = .1) +
scale_y_continuous(breaks = seq(0, 20, 1)) +
labs(x = "Reverse Primers",
y = "Absolute Difference in Melting Temperature °C",
alpha = "Unique Hits \n per Assay") +
theme(axis.text.x = element_text(angle=70, hjust = 1))
Absolute melting temperature difference between forward and reverse primers from the best hit for each sequence in the target database. The dashed lines indicates the 10th and 90th percentiles and the solid line represents the median.
In [81]:
%%R -w 900 -h 500
theme_set(theme_bw())
ggplot(grouped) +
aes(x = reorder(rev_primer, AmpLen), y = AmpLen) +
geom_abline(intercept = median(grouped$AmpLen),
slope = 0,
colour = "red") +
geom_abline(intercept = quantile(grouped$AmpLen, .95),
slope = 0,
linetype = 2,
colour = "red") +
geom_abline(intercept = quantile(grouped$AmpLen, .05),
slope = 0,
linetype = 2,
colour = "red") +
geom_jitter(alpha = .1) +
scale_y_continuous(breaks = seq(0, 500, 5)) +
labs(x = "Reverse Primers", y = "Amplicon length") +
theme(axis.text.x = element_text(angle=70, hjust = 1))
Amplicon size for the best hit for each sequence in the target database. The dashed lines indicates the 10th and 90th percentiles and the solid line represents the median.
In [82]:
with open("./data/fwd_primers.txt", "w") as handle:
for pos, fwd in enumerate(fwd_degen):
handle.write("fwd_%d\t%s\n" % (pos, fwd))
with open("./data/rev_primers.txt", "w") as handle:
for pos, rev in enumerate(rev_degen_comp):
handle.write("rev_%d\t%s\n" % (pos, rev))