When inspecting the results of the reference-based assembly you may find that many loci are not covered by many or any reads. In that case you may want to make a selection of loci with good coverage and continue to only work with these loci for downstream analyses. secapr
has a function that helps you to extract loci with good coverage called locus_selection
.
In [1]:
%%bash
source activate secapr_env
secapr locus_selection -h
This function will compile the average read-coverage for each locus and sample and will select the n
loci with the best coverage accross all samples.
You can run this function simply by providing the folder containing the remapping results (--input
) and stating how many of the best loci you want to extract (--n
). You can additionally use the --read_cov
flag to provide a custom read coverage threshold (default=3). Only loci with an average read-coverage above this threshold in all samples are being extracted. If not n loci are found that fulfil this threshold for all samples, the script will print a warning to the screen but still extract all loci fulfilling the requirement.
secapr locus_selection --input ../../data/processed/remapped_reads --output ../../data/processed/selected_loci --n 50
In [3]:
import sys
sys.path.append("../../src")
import plot_contig_data_function as secapr_plot
contig_input_file = '../../data/processed/target_contigs/match_table.txt'
alignment_folder = '../../data/processed/alignments/contig_alignments'
read_cov_file_selected = '../../data/processed/selected_loci/overview_selected_loci.txt'
secapr_plot.plot_contigs_alignments_read_cov(contig_input_file,alignment_folder,read_cov_file_selected,norm_value=10)
Out[3]:
We can also plot the results of the selection in a more detailed view, by showing the contig and read coverage of only the selected loci:
In [5]:
%matplotlib inline
selected_loci = secapr_plot.plot_contigs_alignments_read_cov(contig_input_file,alignment_folder,read_cov_file_selected,reduce=True,norm_value=10)
#selected_loci.savefig(os.path.join('/Users/tobias/GitHub/seqcap_processor/data/processed/selected_loci_50/','contig_exon_coverage_matrix_reduced.png'), dpi = 500)
And here are the corresponding legends:
In [3]:
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
legend = secapr_plot.plot_heatmap_legend(0,10,width=.75,height=2, font_size=14)
contig = secapr_plot.general_scale_bar(2,tick_labels=['No','Yes'],x0=.1,x1=.25,plot_height=.5,plot_width=.3,font_size = 14,color1='white',color2=(0.031372549019607843, 0.25098039215686274, 0.50588235294117645),height=1,width=.75,plot_label='Contig present')
align = secapr_plot.general_scale_bar(2,tick_labels=['No','Yes'],x0=.1,x1=.25,plot_height=.5,plot_width=.3,font_size = 14,color1='white',color2=(0.0, 0.26666666666666666, 0.10588235294117647),height=1,width=.75,plot_label='Alignment present')
plt.show(contig)
plt.show(align)
plt.show(legend)
The plotting scripts also automatically output a text file that contains the corresponding locus name for each exon index in the plot. The text file is stored in the same folder as the input data, below we show the first lines of the file (note that the locus names have been assigned numerical IDs by secapr, see file reference_fasta_header_info.txt
in the output folder of the secapr find_target_contigs
function):
In [6]:
import pandas as pd
pd.read_csv('../../data/processed/selected_loci/locus_index_overview.txt',sep='\t',header=None).head(10)
Out[6]:
Now after selecting the best loci we want to build Multiple Sequence Alignments (MSAs) from these loci. For this we can simply run the secapr align_sequences
function:
secapr align_sequences --sequences ../../data/processed/selected_loci/joined_fastas_selected_loci.fasta --output ../../data/processed/alignments/selected_loci_contig_alignments --aligner mafft --output-format fasta --no-trim --ambiguous
In [7]:
%%bash
cd ../../
python src/heatmap_plot.py
In [9]:
%%html
<div>
<a href="https://plot.ly/~tobiashofmann/48/?share_key=wC4zjzzzXVpyZ4iRjUja28" target="_blank" title="plot from API (24)" style="display: block; text-align: center;"><img src="https://plot.ly/~tobiashofmann/48.png?share_key=wC4zjzzzXVpyZ4iRjUja28" alt="plot from API (24)" style="max-width: 100%;width: 600px;" width="600" onerror="this.onerror=null;this.src='https://plot.ly/404.png';" /></a>
<script data-plotly="tobiashofmann:48" sharekey-plotly="wC4zjzzzXVpyZ4iRjUja28" src="https://plot.ly/embed.js" async></script>
</div>
The read-coverage in the set of selected loci however is rather good for all/most samples:
In [10]:
%%html
<div>
<a href="https://plot.ly/~tobiashofmann/50/?share_key=VZLFvEmzO1oJ3VGD3SUc8g" target="_blank" title="plot from API (25)" style="display: block; text-align: center;"><img src="https://plot.ly/~tobiashofmann/50.png?share_key=VZLFvEmzO1oJ3VGD3SUc8g" alt="plot from API (25)" style="max-width: 100%;width: 600px;" width="600" onerror="this.onerror=null;this.src='https://plot.ly/404.png';" /></a>
<script data-plotly="tobiashofmann:50" sharekey-plotly="VZLFvEmzO1oJ3VGD3SUc8g" src="https://plot.ly/embed.js" async></script>
</div>