Design of Primers for Archaea amoA Nitrosopumilus Cluster

Author: Teotonio Soares de Carvalho

Please contact me at teotonio@msu.edu if you have any suggestions or questions.

Load Functions



In [5]:

    
%notebook -f py functions_definition.ipynb
execfile("./functions_definition.py")









    



importing partial from functools on engine(s)
importing numpy on engine(s)
importing primer3 on engine(s)
importing SeqIO,Seq from Bio on engine(s)
importing Template from string on engine(s)
importing regex on engine(s)
importing os on engine(s)
importing Seq from Bio.Seq on engine(s)
importing pandas on engine(s)
importing combinations,izip,product,permutations from itertools on engine(s)



In [5]:

    
# If you get the message: "IOError: Please, start an IPython cluster to proceed." that means that you whoud go 
# to the home page where the notebooks are listed. Look for the "cluster" tab in that page, click on it. 
# Click in the field "# of engines". Insert the number of cluster (use the same number of processors 
# that you have, probably 4) and click "start". Wait a few seconds and rerun the cell above.

Make Visualization tools



In [2]:

    
help(main_visualization)









    



Help on function main_visualization in module __main__:

main_visualization(reference_seq_ids, aligned_fasta_file, out_fasta, similarity_threshold, composition_name, classes=None, deduplicate=True)
    This function reorders the provided aligned sequences in a fasta file according to their 
    similarity to a reference sequence in the file. It also plot the base composition for the 
    resulting ordered fasta file.
    Arguments:
        -reference_seq_ids: must be a list with one or more references. Note the reference is the 
                            first "word" that follows tha > sign in a fasta file. For example, a
                            record where the sequence description is as follow:
                            >Bradyrhizobium_3 japonicum
                            The reference for this sequence would be "Bradyrhizobium_3".
                            Also note that this argument must be a list, meaning that the reference(s)
                            must be enclosed by square brackets [] as in the example.
        -aligned_fasta_file: a string giving the name (including the path) of the aligned fasta file
                             to be ordered. This function takes only one fasta file at a time.
        -out_fasta: the prefix of the name of the output file. The reference sequence ID will be added
                    to this prefix to make the name of the file. Make sure you include the path.
        -similarity_threshold: them minimum similarity to the reference. Sequences will be discarded 
                               if they are less similar then the threshold. The distance used is the
                               Hamming distance proportional to sequence size. Pairwise deletion is 
                               used to deal with missing data.
        -composition_name: the name of the pdf file with the composition plot, including the path.
        -classes: an optional argument in case you want to give a meaningful title for each composition
                  plot instead of the reference_id. Note that it must be a list and the order of the 
                  elements is correspondent to the order of the elements in reference_seq_ids.
        -deduplicate: a boolean (True or False) argument indicating whether or not the fasta file
                      should be deduplicated.



In [4]:

    
# Example with more than one reference
main_visualization(reference_seq_ids = ["AB353450", "DQ278583", "AB576777"],
                   classes = ["Nitrososphaera", "Nitrososphaera sister", "Nitrosotalea"],
                   aligned_fasta_file = "./data/aligned_not_amb",
                   out_fasta = "./data/ordered_records",
                   similarity_threshold = .80, # Be careful with the similarity threshold you use.
                   composition_name = "./Example.pdf",
                   deduplicate = True)









    



Removing duplicated sequences.
cd-hit-est -i ./data/ordered/unaligned.fasta -o ./data/ordered/unaligned_not_amb.fasta -c 1 -n 10 -T 0 -M 0
Calculating composition.
Calculating composition.
Calculating composition.
Done!
The pdf file with compositions was saved as: ./Example.pdf.
The ordered fasta file(s) were/was saved as follows:
./data/ordered_recordsAB353450
./data/ordered_recordsDQ278583
./data/ordered_recordsAB576777



In [5]:

    
# Example with one reference
main_visualization(reference_seq_ids = ["AB353450"],
                   classes = ["Nitrososphaera"],
                   aligned_fasta_file = "./data/aligned_not_amb",
                   out_fasta = "./data/ordered_records",
                   similarity_threshold = .80,
                   composition_name = "./Example2.pdf",
                   deduplicate = True)









    



Removing duplicated sequences.
cd-hit-est -i ./data/ordered/unaligned.fasta -o ./data/ordered/unaligned_not_amb.fasta -c 1 -n 10 -T 0 -M 0
Calculating composition.
Done!
The pdf file with compositions was saved as: ./Example2.pdf.
The ordered fasta file(s) were/was saved as follows:
./data/ordered_recordsAB353450

Make screening of primers along all positions in the aligment

The purpose of this step is to evaluate the primer quality along all positions in the alignment. This will be further used to select combinations of positions where primers will be enumerated. The k-mers are then filtered based on criteria defined in the function "screen oligos".



In [6]:

    
help(screen_oligos)









    



Help on function screen_oligos in module __main__:

screen_oligos(fasta_file, kmer_sizes, hairpin_tm_max=35, homo_tm_max=35, tm_max=65, tm_min=50, min_occurrence=5, no_3_T=True, no_poly_3_GC=True, max_degen=60, no_poly_run=True, step=3, primer_conc=200, max_gap_prop=0.1, min_diff=0, mv_conc=50, dv_conc=1.5, look_ahead=50)
    Apply the function enumerate_kmers_screen in parallel for enumerating all possible oligos that match some criteria
    specified in its arguments along a given alignment. This is used for screening positions
    in the alignment that might be useful for designing primers.
    Arguments:
        -fasta_file: a string given the name of the fasta file with aligned sequences to be used for enumeration of oligos;
        -kmer_sizes: a list of integers with the desired size of oligos;
        -hairpin_tm_max: maximal hairpin melting temperature allowed (in Celsius degrees);
        -homo_tm_max: maximal homodimer melting temperature allowed (in Celsius degrees);
        -tm_max: maximal melting temperature allowed for a oligo (in Celsius degrees);
        -tm_min: minimal melting temperature allowed for a oligo (in Celsius degrees);
        -min_occurrence: integer. Minimal allowed occurrence of a oligo along all sequences;
        -no_3_T: boolean. Should oligos with a T in the 3' end be discarded?
        -no_poly_3_GC: boolean. Should oligos with three G's of C's in the 3' end be discarded?
        -max_degen: the maximal number of subprimers desired. This will also be used to rescale the oligo concentration.
        -no_poly_run: boolean. Should oligos with four or more runs of the same bases be discarded?
        -step: distance between positions in the aligment from which primers should be enumerated. A step of 1 implies that
               all positions will be used.
        -primer_conc: total primer concentration in nM. This concentration will be rescaled automatically by the max_degen.
        -max_gap_prop: float. Maximal proportion of gaps allowed for any position where oligos will be enumerated.
        -min_diff: minimal difference of sequences detected for a oligo to be considered redundant. This parameter should be
                   kept at its default value unless you have strong reasons to change it.
        -mv_conc: monovalent ions concentration in mM.
        -dv_conc: divalent ions conentration in mM.



In [9]:

    
file_tested_algn = "./data/algn_contigs_Nitrosopumilus_cluster.fasta"
file_tested_not_algn = "./data/unalgn_contigs_Nitrosopumilus_cluster.fasta"



In [10]:

    
screening = screen_oligos(fasta_file = file_tested_algn,
                        kmer_sizes = [20, 21, 22, 23, 24, 25, 26, 27, 28], 
                        step = 1, 
                        hairpin_tm_max = 35, 
                        homo_tm_max = 35,
                        tm_max = 75, 
                        tm_min = 40, 
                        min_occurrence = 50, 
                        no_3_T = False, 
                        no_poly_3_GC = True, 
                        no_poly_run = True,
                        max_degen = 60,
                        max_gap_prop = .01, 
                        primer_conc = 200,
                        dv_conc = 1.5,
                        mv_conc = 50,
                        min_diff = 0)









    



  29/ 29 tasks finished after  108 s
done

Evaluate primers quality along the alignment



In [11]:

    
%reload_ext rpy2.ipython



In [12]:

    
data = screening["data"]
data = data.ix[data.Median_Tm > 0, :].dropna()



In [13]:

    
data.to_csv("./data/coverage_data.txt")



In [14]:

    
%%R -i data -w 900 -h 600
library(ggplot2)
library(reshape2)
sub_data = data[, c("Pos", "Coverage", "Richness", "Tm_10", "Tm_90")]
sub_data$cover = ifelse(sub_data$Coverage > .6, ">60%", "<60%")
molt_data = melt(sub_data, 
                id.vars = c("Pos", "cover", "Tm_10", "Tm_90")) 
theme_set(theme_bw(16))
ggplot(molt_data) + 
    aes(x = Pos, y = value) +
    facet_wrap(~variable, scale = "free", ncol = 1) + 
    scale_x_continuous(breaks = seq(0, 800, 10)) +
    scale_colour_manual(values = c("black", "red")) + 
    geom_point(aes(colour = factor(cover))) +
    geom_line(alpha = .4) +
    theme(axis.text.x=element_text(angle=90, size = 10)) +
    labs(colour = "Coverage", y = "", x = "Position in the Alignment")

Coverage (proportional) and richness (number of unique k-mers) of the n top primers (up to max_degen) for each position in the alignment. Primers are enumerated for each position as all k-mers with k varying from 20 to 28 taken from all sequences in the aligment. These k-mers are then filtered based on criteria specified in the "screen_oligos" function above (min Tm, max Tm, max hairpin Tm, max homo and heterodimers Tm, among others).



In [15]:

    
%%R -w 1300 -h 500
data$cover = ifelse(data$Coverage > .6, ">60%", "<60%")
ggplot(data) +
    aes(x = Pos, y = Median_Tm) +
    geom_ribbon(aes(ymax = Tm_90, 
                    ymin = Tm_10),
               alpha = .2) +
    geom_line(alpha = .5) +
    geom_point(aes(colour = cover), size = 1.5) +
    scale_colour_manual(values = c("black", "red")) +
    scale_x_continuous(breaks = seq(0, 800, 10)) +
    scale_y_continuous(breaks = seq(0, 100, 1)) +
    theme(axis.text.x=element_text(angle=90, size = 10)) +
    labs(colour = "Coverage", y = "Melting Temperature (°C)", x = "Positions")

Predicted melting temperature for the top n primers (up to max_degen). The poings indicates the median melting temperature at each position. The red colour indicates positions where the coverage of the best primers is above 60%. The shaded area representes region between the 10th to the 90th percentile for the melting temperature of the best primers at each position. The information given by this graph helps to choose regions from where compatible primers could be picked.

Enumerate foward and reverses primers from a specified region of the aligment

Based on the last graph, I decided that the region from 90-105 have potentially good primers. Because I want amplicons with approximately 200 bp for compatibility with qPCR and illumina sequencing, I chose the region from 340-370 to pick the reverse primers. The region 270-290 have good coverage, but the melting temperature is too low compared to the region 90-105, that is why I didn't use it.



In [16]:

    
max_degen = 60



In [17]:

    
help(enumerate_primers)









    



Help on function enumerate_primers in module __main__:

enumerate_primers(target_file_name, fwd_starts, rev_starts, hairpin_tm_max=30, primer_conc=200, homo_tm_max=30, kmer_sizes=[18, 19, 20, 21, 23, 24, 25, 26, 27, 28], tm_min=55, tm_max=60, min_occurrence=10, no_3_T=True, no_poly_3_GC=True, no_poly_run=True, max_degen=60, mv_conc=50, dv_conc=1.5, look_ahead=50)
    Enumerates forward and reverse primers from two regions of an alignment and filters them according to user-defined criteria.
    Arguments:
        -target_file_name: a string given the name of the fasta file with aligned sequences to be used for enumeration of oligos;
        -fwd_starts: a list of integers giving the starting positions for enumerating forward primers;
        -rev_starts: a list of integers giving the starting positions for enumerating reverse primers;
        -hairpin_max_tm: maximal hairpin melting temperature allowed (in Celsius degrees);
        -primer_conc: total primer concentration in nM. This concentration will be rescaled automatically by the max_degen.
        -homo_tm_max: maximal homodimer melting temperature allowed (in Celsius degrees);
        -kmer_sizes: a list of integers with the desired size of oligos;
        -tm_min: minimal melting temperature allowed for a oligo (in Celsius degrees);
        -tm_max: maximal melting temperature allowed for a oligo (in Celsius degrees);
        -min_occurrence: integer. Minimal allowed occurrence of a oligo along all sequences;
        -no_3_T: boolean. Should oligos with a T in the 3' end be discarded?
        -no_poly_3_GC: boolean. Should oligos with three G's of C's in the 3' end be discarded?
        -max_degen: the maximal number of subprimers desired. This will also be used to rescale the oligo concentration.
        -no_poly_run: boolean. Should oligos with four or more runs of the same bases be discarded?
        -step: distance between positions in the aligment from which primers should be enumerated. A step of 1 implies that
               all positions will be used.
        -mv_conc: monovalent ions concentration in mM.
        -dv_conc: divalent ions concentration in mM.



In [18]:

    
fwd_unique, rev_unique, redundancy, pairs, seqs_detected = \
enumerate_primers(target_file_name = file_tested_algn,
                hairpin_tm_max = 35,
                primer_conc = 200,
                homo_tm_max = 30,
                kmer_sizes = [20, 21, 22, 23, 24, 25, 26, 27, 28],
                tm_min = 55,
                tm_max = 60,
                fwd_starts = range(90, 105),
                rev_starts = range(340, 370),
                min_occurrence = 50,
                no_3_T = True,
                no_poly_3_GC = True,
                no_poly_run=True,
                max_degen = max_degen)









    



done
Coverage of all fwd: 6647 
Coverage of all rev: 6750 
Joint coverage of fwd and rev: 6377
Max possible coverage: 6377
Number of Foward Oligos: 259
Number of Reverse Oligos: 582

This function below select the primers as pairs.

I decided to do the removal of heterodimers before the selection. So the selection is about three times slower than before.



In [19]:

    
best = select_pairs(fwd_dict = fwd_unique, 
                    rev_dict = rev_unique, 
                    redundancy_dict = redundancy, 
                    pairs = pairs, 
                    min_increase=20, 
                    max_degen = max_degen)









    



Total Coverage: 5635
Forward Degeneration: 30
Reverse Degeneration: 30
Remaing pairs: 9

When all the best pairs of primers are selected and the max_degen is not reached, more primers are added if they detect perfectly sequences that were not previously detected. They are only selected if they have good quality.

I still did not implement the heterodimer removal whitin the increase_degeneracy function. But I added a function "test_heterodimer" to check it later.



In [20]:

    
fwds = increase_degeneracy(included_oligos = best["fwd"],
                           redundancy = redundancy["fwd"],
                           oligo_series = fwd_unique,
                           min_increase = 50)



In [21]:

    
revs = increase_degeneracy(included_oligos = best["rev"],
                           redundancy = redundancy["rev"],
                           oligo_series = rev_unique,
                           min_increase = 50)



In [22]:

    
len(fwds)









    Out[22]:





30



In [23]:

    
len(revs)









    Out[23]:





32

Just to make sure that there is no potential heterodimers.



In [24]:

    
revs_comp = [str(Seq(rev).reverse_complement()) for rev in best["rev"]]

If you find heterodimers here, you will need to remove them manually.



In [25]:

    
test_heterodimers_post(fwds,
                       revs_comp,
                       primer_conc = 200, 
                       mv_conc = 50, 
                       dv_conc = 1.5,
                       max_ht_tm = 35)









    



No Heterodimers found!

Below I show all the selected oligos and their predicted melting temperature.



In [26]:

    
tm = map(partial(primer3.calcTm, dna_conc = 200/float(max_degen), dv_conc = 1.5, mv_conc = 50), best["fwd"])
print "Primer \t\t\t\t Tm"
for fwd, tm in zip(best["fwd"], tm):
    print "%s\t:%d" % (fwd, tm)









    



Primer 				 Tm
TTGGGCCTGGACATCGTTTG	:56
ACTGACTGGGCTTGGACTTC	:55
TTCTATACTGACTGGGCTTGGAC	:55
TACACTGACTGGGCATGGAC	:55
ACTGATTGGGCCTGGACATC	:55
TTCTACACAGATTGGGCTTGGAC	:56
GACTGGGCCTGGACTTCGTTTG	:59
TTCTACACTGATTGGGCTTGGAC	:56
CTTCTATACTGATTGGGCCTGGAC	:56
TTCTATACTGATTGGGCCTGGAC	:55
TTCTACACTGATTGGGCCTGG	:55
CCGATTGGGCCTGGACTTCG	:59
TACACTGACTGGGCTTGGAC	:55
CAGATTGGGCCTGGACATCA	:55
CAGACTGGGCCTGGACATCA	:57
GACTGGGCTTGGACTTCGTTC	:57
ACAGACTGGGCTTGGACATC	:55
ATCTTCTACACTGATTGGGCATGG	:57
ATCTTCTATACTGACTGGGCTTGG	:55
TTCTACACTGACTGGGCCTGGA	:58
CCGATTGGGCTTGGACATCC	:56
TTCTACACTGATTGGGCATGGAC	:56
TCTTCTATACTGATTGGGCCTGG	:55
TCTACACTGACTGGGCATGGA	:56
ACTGATTGGGCTTGGACATCA	:55
CTGACTGGGCTTGGACTTCG	:56
TTTCTATACTGACTGGGCTTGGA	:55
CAGATTGGGCCTGGACATCC	:56
CTTCTACACTGATTGGGCTTGGACA	:59
TTCTACACAGACTGGGCTTGGAC	:58



In [27]:

    
tm = map(partial(primer3.calcTm, dna_conc = 200/float(max_degen), 
                 dv_conc = 1.5, mv_conc = 50), revs_comp)
print "Primer \t\t\t\t Tm"
for rev, tm in zip(revs_comp, tm):
    print "%s\t:%d" % (rev, tm)









    



Primer 				 Tm
TGCTTATTCTTCTTTGTTGCCCA	:55
AACAGTATCAAGGAGTGCTTGTTC	:55
GCTTGTTCTTCTTTGTTGCCCA	:56
ATCAGTGAGTGCTTGTTCTTCTTTG	:56
AGTATCAAGGAGTGCTTATTCTTCTTTG	:56
TGTTTGTTCTTCTTTGTAGCCCAATA	:56
TTGTTCTTCTTTGTCGCCCAATA	:55
TCTTTGTCGCCCAATATGCCA	:56
TTTCTCTTTGTTGCCCAATATGCTA	:55
TCCGCCAAAGAATATCAGCGAGTG	:59
CCGCCAAACAGTATCAACGAGTG	:58
TGCTTGTTCTTCTTTGTAGCCCA	:56
CGCCAAACAGTATCAGCGAGTG	:58
GCCAAACAGTATCAACGAGTGCTTG	:59
CCGCCAAACAGTATCAGAGAGTG	:57
TCTTTGTAGCCCAATACGCCA	:55
TGCTTGTTCTTCTTTGTAGCCCAATA	:57
CCACCAAAGAATATCAGCGAGTG	:55
TTCTTCTTTGTAGCCCAATAAGCCA	:57
ATCAGAGAGTGCTTGTTCCTCTTTG	:57
GCTTGTTCTTCTTTGTGGCCCAATA	:58
CCGCCAAACAGTATCAAGGAGTG	:58
CTTCTTTGTTGCCCAGTACGC	:56
TTGTTCCTCTTTGTTGCCCAATAG	:56
GCTTATTCTTCTTTGTCGCCCA	:55
TGTTCTTCTTTGTAGCCCAGTATGC	:57
TGTTTGTTCTTCTTTGTTGCCCA	:55
TGTTTGTTCTTCTTTGTGGCCCAATA	:58
GAGTGTTTGTTCTTCTTTGTTGCC	:55
TTCTTCTTTGTAGCCCAGTATGCCA	:58
TCTCTTTGTAGCCCAATATGCCA	:55
TTCTTCGTAGCCCAATATGCCA	:55



In [28]:

    
database = read_fasta(file_tested_not_algn)
database = [(rec.id, str(rec.seq)) for rec in database]



In [29]:

    
len(database)









    Out[29]:





7129

A first evaluation of the potential coverage of the primers is done using regular expressions, which does not take into account the thermodynamics properties of the oligos but only the bases identity. Allowing for one mismatch, the coverage is predicted to be about 96%. The problem of this approach is that it doesn't take into account the impact that different mismatches can have.



In [30]:

    
search_oligos(fwds = fwds, revs = revs, substitution = 1, database = database)









    



  15/ 15 tasks finished after    5 s
done
Coverage is 6903 out of 7129 sequences

Allowing two mismatches



In [31]:

    
search_oligos(fwds = fwds, revs = revs, substitution = 2, database = database)









    



  15/ 15 tasks finished after    8 s
done
Coverage is 7103 out of 7129 sequences

Test the primers on the targets using MFEprimer (taking thermodynamics into account)

Make Tweaked Sequences with dummy ends

This section was made with the purpose of reducing the cost of testing all primers as pairs. I will use MFEprimer in the end of the notebook to test the specificity and the coverage of the primers. The problem is that MFEprimer only accepts primers as pairs. Consequently, 60 forward subprimers combined with 60 reverse subprimers would lead to 3,600 pairs that should be tested. To avoid that, I added two long oligos to each end of the target sequence to be used as a dummy forward and reverse primer. In this way, instead of 3,600 independent searches, only 120 are necessary. The results are them combined to assess the performance of the primers as pairs using dataframe operations in pandas.



In [33]:

    
# This has to be the dealigned sequences file
file_to_test = "./data/unalgn_contigs_Nitrosopumilus_cluster.fasta"



In [42]:

    
# This will never overwrite an existing mfe directory. You need to delete the folder mfe manually
# if you want to overwrite it.
if not os.path.isdir("./data/mfe"):
    os.makedirs("./data/mfe")
    records = read_fasta(file_to_test)
    for record in records:
        record.seq = Seq(("AGTATCCTGGATGGGAGGATGGCAACCACTAAAGCCAGTC" +\
                          str(record.seq) +\
                          "CAACAGATTCATACATAAATTATAAGCACCTCATCACTGT"))
    write_fasta("./data/mfe/database.genomic", records)
else:
    print("If you want to rewrite the folder mfe in ./data/ remove it first and then rerun this cell.")
    print("Aborted!")









    



If you want to rewrite the folder mfe in ./data/ remove it first and then rerun this cell.
Aborted!

Make database

The step in the below is slow ~5min, but need to be done only once. That is why I commented it.



In [83]:

    
!./MFEprimer/IndexDb.sh ./data/mfe/database.genomic

Test the primers separately



In [44]:

    
fwd_dummy = "AGTATCCTGGATGGGAGGATGGCAACCACTAAAGCCAGTC"
rev_dummy = str(Seq("CAACAGATTCATACATAAATTATAAGCACCTCATCACTGT").reverse_complement())



In [47]:

    
fwd_primer = fwds
task = run_mfe_parallel(database = "./data/mfe/database.genomic",
                 output = "./data/out_fwd.txt",
                 ppc = 30,
                 min_tm = 47,
                 fwd_list = fwd_primer,
                 rev_list = [rev_dummy],
                 oligo_conc = 200/float(len(fwd_primer)))
task.successful()









    



  30/ 30 tasks finished after    5 s
done






    Out[47]:





True



In [48]:

    
rev_primer = revs_comp
task = run_mfe_parallel(database = "./data/mfe/database.genomic",
                 output = "./data/out_rev.txt",
                 ppc = 30,
                 min_tm = 47,
                 fwd_list = [fwd_dummy],
                 rev_list = rev_primer,
                 oligo_conc = 200/float(len(rev_primer)))
task.successful()









    



  32/ 32 tasks finished after    6 s
done






    Out[48]:





True

Analyze the results



In [50]:

    
%reload_ext rpy2.ipython



In [51]:

    
data = process_data(fwd_data_name = "./data/out_fwd.txt",
                    rev_data_name = "./data/out_rev.txt",
                    delta_tm = 10)



In [52]:

    
data_best_match = data.ix[(data.FpTm > 45) &
                          (data.RpTm > 45) &
                          (data.DeltaTm < 8), :]



In [53]:

    
records_dummy = read_fasta("./data/mfe/database.genomic")
cover = len(data_best_match.HitID.unique())
print "Predicted coverage is %d out of %d." % (cover, len(records_dummy))









    



Predicted coverage is 6550 out of 7129.



In [55]:

    
%pylab inline
detection_fwd = data_best_match.fwd_primer.value_counts()
detection_fwd = detection_fwd / float(len(records_dummy))
detection_fwd.cumsum().plot(kind = "bar", figsize = (12, 4))









    



Populating the interactive namespace from numpy and matplotlib






    Out[55]:





<matplotlib.axes.AxesSubplot at 0x7f3775d57ad0>



In [56]:

    
detection_rev = data_best_match.rev_primer.value_counts()
detection_rev = detection_rev / float(len(records_dummy))
detection_rev.cumsum().plot(kind = "bar", figsize = (12, 4))









    Out[56]:





<matplotlib.axes.AxesSubplot at 0x7f3775939cd0>

Because of the cost of primer synthesis, I decided to try to eliminate as many primers as possible without decreasing the coverage dramatically.

Discard redundant primers



In [58]:

    
data_raw = process_data(fwd_data_name = "./data/out_fwd.txt",
                        rev_data_name = "./data/out_rev.txt",
                        delta_tm = 10,
                        return_raw = True) # This is important



In [59]:

    
data_filtered = remove_redundant_all(data_raw = data_raw, 
                                     min_increase = 10)



In [60]:

    
fwd_degen = list(data_filtered.fwd_primer.unique())



In [61]:

    
rev_degen_comp = list(data_filtered.rev_primer.unique())



In [62]:

    
for fwd in fwd_degen:
    print fwd









    



TTCTACACAGATTGGGCTTGGAC
TTCTACACTGATTGGGCTTGGAC
TTCTATACTGACTGGGCTTGGAC
CAGATTGGGCCTGGACATCC
ATCTTCTACACTGATTGGGCATGG
CAGACTGGGCCTGGACATCA
ACAGACTGGGCTTGGACATC
TACACTGACTGGGCTTGGAC
ACTGATTGGGCCTGGACATC
CTTCTATACTGATTGGGCCTGGAC
CAGATTGGGCCTGGACATCA
GACTGGGCCTGGACTTCGTTTG
TTCTACACTGACTGGGCCTGGA
TCTACACTGACTGGGCATGGA



In [63]:

    
for rev in rev_degen_comp:
    print rev









    



TCTTTGTCGCCCAATATGCCA
ATCAGAGAGTGCTTGTTCCTCTTTG
CCACCAAAGAATATCAGCGAGTG
GCCAAACAGTATCAACGAGTGCTTG
GCTTGTTCTTCTTTGTTGCCCA
GCTTGTTCTTCTTTGTGGCCCAATA
CCGCCAAACAGTATCAACGAGTG
TGTTTGTTCTTCTTTGTTGCCCA
AGTATCAAGGAGTGCTTATTCTTCTTTG
TGCTTATTCTTCTTTGTTGCCCA
ATCAGTGAGTGCTTGTTCTTCTTTG
TCTCTTTGTAGCCCAATATGCCA
AACAGTATCAAGGAGTGCTTGTTC
TGCTTGTTCTTCTTTGTAGCCCA
TGTTTGTTCTTCTTTGTAGCCCAATA
TTTCTCTTTGTTGCCCAATATGCTA

Rerun the analysis with reduced primer set



In [64]:

    
fwd_dummy = "AGTATCCTGGATGGGAGGATGGCAACCACTAAAGCCAGTC"
rev_dummy = str(Seq("CAACAGATTCATACATAAATTATAAGCACCTCATCACTGT").reverse_complement())



In [65]:

    
fwd_primer = fwd_degen
task = run_mfe_parallel(database = "./data/mfe/database.genomic",
                 output = "./data/out_red_fwd.txt",
                 ppc = 30,
                 min_tm = 50,
                 fwd_list = fwd_primer,
                 rev_list = [rev_dummy],
                 oligo_conc = 200/float(len(fwd_primer)))
task.successful()









    



  14/ 14 tasks finished after    2 s
done






    Out[65]:





True



In [66]:

    
rev_primer = rev_degen_comp
task = run_mfe_parallel(database = "./data/mfe/database.genomic",
                 output = "./data/out_red_rev.txt",
                 ppc = 30,
                 min_tm = 50,
                 fwd_list = [fwd_dummy],
                 rev_list = rev_primer,
                 oligo_conc = 200/float(len(rev_primer)))
task.successful()









    



  16/ 16 tasks finished after    3 s
done






    Out[66]:





True



In [67]:

    
data = process_data(fwd_data_name = "./data/out_red_fwd.txt",
                    rev_data_name = "./data/out_red_rev.txt",
                    delta_tm = 10)



In [68]:

    
data_best_match = data.ix[(data.FpTm > 50) &
                          (data.RpTm > 50) &
                          (data.DeltaTm < 8), :]



In [69]:

    
%pylab inline
detection_fwd = data_best_match.fwd_primer.value_counts()
detection_fwd = detection_fwd / float(len(records_dummy))
detection_fwd.cumsum().plot(kind = "bar", figsize = (12, 4))









    



Populating the interactive namespace from numpy and matplotlib






    Out[69]:





<matplotlib.axes.AxesSubplot at 0x7f3775bff450>



In [70]:

    
detection_fwd









    Out[70]:





ACTGATTGGGCCTGGACATC        0.190349
TTCTATACTGACTGGGCTTGGAC     0.177865
TACACTGACTGGGCTTGGAC        0.138308
CTTCTATACTGATTGGGCCTGGAC    0.073783
TCTACACTGACTGGGCATGGA       0.058774
TTCTACACAGATTGGGCTTGGAC     0.044045
TTCTACACTGATTGGGCTTGGAC     0.043905
GACTGGGCCTGGACTTCGTTTG      0.036050
CAGACTGGGCCTGGACATCA        0.023566
CAGATTGGGCCTGGACATCA        0.019358
ACAGACTGGGCTTGGACATC        0.019217
ATCTTCTACACTGATTGGGCATGG    0.016272
TTCTACACTGACTGGGCCTGGA      0.013466
CAGATTGGGCCTGGACATCC        0.008837
dtype: float64



In [71]:

    
detection_rev = data_best_match.rev_primer.value_counts()
detection_rev = detection_rev / float(len(records_dummy))
detection_rev.cumsum().plot(kind = "bar", figsize = (12, 4))









    Out[71]:





<matplotlib.axes.AxesSubplot at 0x7f37756f69d0>



In [72]:

    
detection_rev









    Out[72]:





TGCTTATTCTTCTTTGTTGCCCA         0.162575
GCTTGTTCTTCTTTGTTGCCCA          0.130874
AACAGTATCAAGGAGTGCTTGTTC        0.073222
TGCTTGTTCTTCTTTGTAGCCCA         0.067611
ATCAGTGAGTGCTTGTTCTTCTTTG       0.064104
GCCAAACAGTATCAACGAGTGCTTG       0.055127
TCTTTGTCGCCCAATATGCCA           0.049656
AGTATCAAGGAGTGCTTATTCTTCTTTG    0.042643
TGTTTGTTCTTCTTTGTAGCCCAATA      0.042082
GCTTGTTCTTCTTTGTGGCCCAATA       0.037593
CCACCAAAGAATATCAGCGAGTG         0.036892
TCTCTTTGTAGCCCAATATGCCA         0.030720
TGTTTGTTCTTCTTTGTTGCCCA         0.025670
CCGCCAAACAGTATCAACGAGTG         0.017674
TTTCTCTTTGTTGCCCAATATGCTA       0.015290
ATCAGAGAGTGCTTGTTCCTCTTTG       0.012063
dtype: float64



In [73]:

    
calcTm = lambda x: primer3.calcTm(x, mv_conc = 50, dv_conc = 1.5, dna_conc = 200/float(len(fwd_primer)))
p = pandas.DataFrame(map(calcTm, fwd_primer)).sort(inplace = False)



In [74]:

    
%%R -i p -w 600 -h 300 
library(ggplot2)
p$sub = rownames(p)
theme_set(theme_bw())
ggplot(p) +
    aes(y = X0 + 1,
       x = reorder(sub, X0)) + 
    geom_point() + 
    scale_x_discrete(breaks = NULL) +
    scale_y_continuous(breaks = seq(30, 100, 1)) +
    labs(x = "Sub-primers", y = "Melting Temperature °C", title = "Forward")



In [75]:

    
calcTm = lambda x: primer3.calcTm(x, mv_conc = 50, dv_conc = 1.5, dna_conc = 200/float(len(rev_primer)))
p = pandas.DataFrame(map(calcTm, rev_primer)).sort(inplace = False)



In [76]:

    
%%R -i p -w 600 -h 300 
library(ggplot2)
p$sub = rownames(p)
theme_set(theme_bw())
ggplot(p) +
    aes(y = X0 + 1,
       x = reorder(sub, X0)) + 
    geom_point() + 
    scale_x_discrete(breaks = NULL) +
    scale_y_continuous(breaks = seq(30, 100, 1)) +
    labs(x = "Sub-primers", y = "Melting Temperature °C", title = "Reverse")



In [77]:

    
cover = len(data_best_match.HitID.unique())
print "Predicted coverage is %d out of %d." % (cover, len(records_dummy))









    



Predicted coverage is 6158 out of 7129.

The melting temperature here might be slightly different from the originally predicted for following reasons:

The primer concentration is rescaled by the actual number of primers and not by the max_degen parameter;
Mismatches will naturally decrease the melting temperature;
The primers with inosine will be variable Tm depending of the template sequence.



In [78]:

    
%%R -i data_best_match -w 900 -h 500 
library(ggplot2)
library(reshape2)
grouped <- data_best_match

theme_set(theme_bw())
ggplot(grouped) +
    geom_abline(intercept = median(grouped$FpTm), 
                slope = 0,
               colour = "red") +
    geom_abline(intercept = quantile(grouped$FpTm, .95), 
                slope = 0,
               linetype = 2,
               colour = "red") +
    geom_abline(intercept = quantile(grouped$FpTm, .05), 
                slope = 0,
               linetype = 2,
               colour = "red") +
    aes(x = reorder(fwd_primer, FpTm), y = FpTm) +
    geom_jitter(alpha = .1) +
    scale_y_continuous(breaks = seq(45, 70, .5)) +
    labs(x = "Forward Primers", y = "Forward Melting Temperature °C", 
         alpha = "Unique Hits \n per Assay") +
    theme(axis.text.x = element_text(angle=70, hjust = 1))

Melting temperature (y axis) for all forward primers (x axis). Each point represents the best hit for each sequence in the target database. Transparency is used to facilitate the visualization of the density. The dashed lines indicates the 10th and 90th percentiles and the solid line represents the median.



In [79]:

    
%%R -w 900 -h 500 
theme_set(theme_bw())
ggplot(grouped) +
    geom_abline(intercept = median(grouped$RpTm), 
                slope = 0,
               colour = "red") +
    geom_abline(intercept = quantile(grouped$RpTm, .95), 
                slope = 0,
               linetype = 2,
               colour = "red") +
    geom_abline(intercept = quantile(grouped$RpTm, .05), 
                slope = 0,
               linetype = 2,
               colour = "red") +
    aes(x = reorder(rev_primer, RpTm), y = RpTm) +
    geom_jitter(alpha = .1) +
    scale_y_continuous(breaks = seq(45, 70, .5)) +
    labs(x = "Reverse Primers", y = "Reverse Melting Temperature °C", 
         alpha = "Unique Hits \n per Assay") +
    theme(axis.text.x = element_text(angle=70, hjust = 1))

Melting temperature (y axis) for all reverse primers (x axis). Each point represents the best hit for each sequence in the target database. Transparency is used to facilitate the visualization of the density. The dashed lines indicates the 10th and 90th percentiles and the solid line represents the median.



In [80]:

    
%%R -w 900 -h 500 
theme_set(theme_bw())
ggplot(grouped) +
    aes(x = reorder(rev_primer, DeltaTm), y = DeltaTm) +
    geom_abline(intercept = median(grouped$DeltaTm), 
                slope = 0,
               colour = "red") +
    geom_abline(intercept = quantile(grouped$DeltaTm, .95), 
                slope = 0,
               linetype = 2,
               colour = "red") +
    geom_abline(intercept = quantile(grouped$DeltaTm, .05), 
                slope = 0,
               linetype = 2,
               colour = "red") +
    geom_jitter(alpha = .1) +
    scale_y_continuous(breaks = seq(0, 20, 1)) +
    labs(x = "Reverse Primers", 
         y = "Absolute Difference in Melting Temperature °C",
         alpha = "Unique Hits \n per Assay") +
    theme(axis.text.x = element_text(angle=70, hjust = 1))

Absolute melting temperature difference between forward and reverse primers from the best hit for each sequence in the target database. The dashed lines indicates the 10th and 90th percentiles and the solid line represents the median.



In [81]:

    
%%R -w 900 -h 500 
theme_set(theme_bw())
ggplot(grouped) +
    aes(x = reorder(rev_primer, AmpLen), y = AmpLen) +
    geom_abline(intercept = median(grouped$AmpLen), 
                slope = 0,
               colour = "red") +
    geom_abline(intercept = quantile(grouped$AmpLen, .95), 
                slope = 0,
               linetype = 2,
               colour = "red") +
    geom_abline(intercept = quantile(grouped$AmpLen, .05), 
                slope = 0,
               linetype = 2,
               colour = "red") +
    geom_jitter(alpha = .1) +
    scale_y_continuous(breaks = seq(0, 500, 5)) +
    labs(x = "Reverse Primers", y = "Amplicon length") +
    theme(axis.text.x = element_text(angle=70, hjust = 1))

Amplicon size for the best hit for each sequence in the target database. The dashed lines indicates the 10th and 90th percentiles and the solid line represents the median.

Save primers to file



In [82]:

    
with open("./data/fwd_primers.txt", "w") as handle:
    for pos, fwd in enumerate(fwd_degen):
        handle.write("fwd_%d\t%s\n" % (pos, fwd))
with open("./data/rev_primers.txt", "w") as handle:
    for pos, rev in enumerate(rev_degen_comp):
        handle.write("rev_%d\t%s\n" % (pos, rev))