This IPython notebook file demonstrates some of the functionality available in poretools, and an example run report for a recent R7 chemistry nanopore run.

Some of the examples call out to R for plotting, so to make this work in this notebook we need to load the rpy2.ipython module.


In [53]:
%load_ext rpy2.ipython

poretools can run either on an individual FAST5 file, a directory containing FAST5 files, or a tar archive of FAST5 files. Here we set up a $directory variable for use for the rest of the tutorial. You could change this and run the same commands on your data.


In [18]:
directory='/mnt/borage/nick/nanopore/data/Flowcell6/downloads'

In [65]:
!find $directory -maxdepth 1 -name "*.fast5" | wc -l


60196

There are 60,196 FAST5 files in the directory.

poretools has a number of different command line options. Running poretools with no parameters gives us a brief list (and complies with Torsten's first rule)


In [21]:
!poretools


usage: poretools [-h] [-v]
                 
                 {combine,fastq,fasta,stats,hist,events,readstats,tabular,nucdist,qualdist,winner,squiggle,times,yield_plot}
                 ...
poretools: error: too few arguments

We can get more information if we run poretools with the -h (help) option.


In [22]:
!poretools -h


usage: poretools [-h] [-v]
                 
                 {combine,fastq,fasta,stats,hist,events,readstats,tabular,nucdist,qualdist,winner,squiggle,times,yield_plot}
                 ...

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         Installed poretools version

[sub-commands]:
  {combine,fastq,fasta,stats,hist,events,readstats,tabular,nucdist,qualdist,winner,squiggle,times,yield_plot}
    combine             Combine a set of FAST5 files in a TAR achive
    fastq               Extract FASTQ sequences from a set of FAST5 files
    fasta               Extract FASTA sequences from a set of FAST5 files
    stats               Get read size stats for a set of FAST5 files
    hist                Plot read size histogram for a set of FAST5 files
    events              Extract each nanopore event for each read.
    readstats           Extract signal information for each read over time.
    tabular             Extract the lengths and name/seq/quals from a set of
                        FAST5 files in TAB delimited format
    nucdist             Get the nucl. composition of a set of FAST5 files
    qualdist            Get the qual score composition of a set of FAST5 files
    winner              Get the longest read from a set of FAST5 files
    squiggle            Plot the observed signals for FAST5 reads.
    times               Return the start times from a set of FAST5 files in
                        tabular format
    yield_plot          Plot the yield over time for a set of FAST5 files

Let's start with a simple one, the stats command, this will give us some basic statistics about our reads.

The -q option stops poretools outputting any warning messages.


In [68]:
!poretools stats -q $directory


total reads	104969
total base pairs	550200969
mean	5241.56
median	4616
min	5
max	154417

How do we have 104,969 reads from 60,196 FAST5 files? That's because forward, reverse and two-directional reads are all counted separately.


In [69]:
!poretools stats -q --type fwd $directory


total reads	53914
total base pairs	290880019
mean	5395.26
median	4441
min	5
max	154417

We have 53,914 forward reads in our total dataset.


In [71]:
!poretools stats -q --type rev $directory


total reads	29622
total base pairs	124718901
mean	4210.35
median	3765
min	5
max	44835

In [70]:
!poretools stats -q --type 2D $directory


total reads	21433
total base pairs	134602049
mean	6280.13
median	6020
min	211
max	38598

We have 21,433 two-direction reads, which is about 40% of the reads which have been base-called and about 72% of the reads that have a detectable complement strand.


In [73]:
!poretools readstats -q $directory > readstats.txt

In [ ]:
!wc -l readstats.txt

readstats gives you a line per every FAST5 file in your dataset. The columns are:

  • start_time (represented as a UNIX timestamp, e.g. seconds since the UNIX epoch)
  • pore number
  • read number (not working ATM, the format changed)
  • length of forward read
  • length of reverse read

In [75]:
!head -10 readstats.txt


1404984747	491	None	301	0
1405005921	325	None	299	0
None	410	None	0	0
1404940637	415	None	5271	5113
1405002720	222	None	3064	0
1404959466	209	None	5901	5634
1404949019	12	None	4524	3576
1404930734	470	None	345	0
1404936756	491	None	310	0
1404947432	109	None	697	0

One useful plot you can easily do with the output of read stats is to plot the number of events in forward reads against reverse reads. Ideally every read would have a similar number, which would indicate the hairpin is correctly attached and the strand translocation rate is controlled by the enzyme.


In [79]:
%R stats=read.table("readstats.txt", sep="\t")
%R stats=subset(stats, V4 < 20000 & V5 < 20000)


Out[79]:
&ltclass 'pandas.core.frame.DataFrame'>
Int64Index: 59048 entries, 0 to 59047
Data columns (total 5 columns):
V1    59048  non-null values
V2    59048  non-null values
V3    59048  non-null values
V4    59048  non-null values
V5    59048  non-null values
dtypes: int32(2), object(3)

In [77]:
%R smoothScatter(stats$V4,stats$V5)



In [ ]:
!poretools winner -q --type 2D $directory > winner.fasta

This will just use the header data to generate a squiggle plot:


In [126]:
!head -1 winner.fasta | sed 's/>.* //' | xargs poretools squiggle --saveas png --num-facets 12

In [127]:
squiggle=!head -1 winner.fasta | sed 's/>.* //'
squiggle=squiggle[0] + '.png'
Image(squiggle)


Out[127]:

In [128]:
!poretools yield_plot -q --theme-bw --saveas yield_plot.png --plot-type reads $directory

In [ ]:
!poretools yield_plot -q --theme-bw --saveas yield_plot.pdf --plot-type reads $directory

In [86]:
Image("yield_plot.png")


Out[86]:

In [87]:
!poretools hist -q --theme-bw --min-length 1000 --max-length 40000 --saveas hist.png $directory

In [129]:
!poretools hist -q --theme-bw --min-length 1000 --max-length 40000 --saveas hist.pdf $directory

In [88]:
Image("hist.png")


Out[88]:

In [89]:
!poretools nucdist -q $directory


A	65660764	296997439	0.221081919834
C	70989346	296997439	0.239023428077
T	70730220	296997439	0.23815094244
G	75164353	296997439	0.253080811919
N	14452756	296997439	0.0486628977296

In [90]:
!poretools qualdist -q $directory


!	0	13821031	296997405	0.0465358645137
"	1	25112550	296997405	0.0845547791907
#	2	30275124	296997405	0.101937335109
$	3	28014262	296997405	0.0943249386304
%	4	27335226	296997405	0.0920386021555
&	5	29026969	296997405	0.097734756302
'	6	28000641	296997405	0.0942790762768
(	7	25262940	296997405	0.0850611472514
)	8	21469851	296997405	0.0722896922281
*	9	17021140	296997405	0.0573107364356
+	10	12194629	296997405	0.0410597156564
,	11	9132498	296997405	0.0307494201843
-	12	6955173	296997405	0.0234182955235
.	13	5258815	296997405	0.0177066025207
/	14	3952021	296997405	0.0133065842781
0	15	2980747	296997405	0.0100362728759
1	16	2267230	296997405	0.00763383774346
2	17	1751907	296997405	0.00589872830707
3	18	1375773	296997405	0.00463227279713
4	19	1098801	296997405	0.00369969899232
5	20	888009	296997405	0.00298995541729
6	21	726878	296997405	0.00244742205744
7	22	599176	296997405	0.0020174452366
8	23	497704	296997405	0.00167578568574
9	24	415916	296997405	0.00140040280823
:	25	746873	296997405	0.002514745878
;	26	215364	296997405	0.00072513764893
<	27	176731	296997405	0.000595059071307
=	28	138388	296997405	0.000465956933193
>	29	103529	296997405	0.000348585537305
?	30	73216	296997405	0.000246520672462
@	31	49430	296997405	0.000166432430613
A	32	30164	296997405	0.000101563176958
B	33	16107	296997405	5.42327970845e-05
C	34	8018	296997405	2.69968688784e-05
D	35	3409	296997405	1.14782147676e-05
E	36	996	296997405	3.35356465488e-06
F	37	129	296997405	4.34347229398e-07
G	38	11	296997405	3.70373606463e-08
I	40	2	296997405	6.73406557205e-09
L	43	5	296997405	1.68351639301e-08
P	47	2	296997405	6.73406557205e-09
V	53	3	296997405	1.01010983581e-08
Z	57	5	296997405	1.68351639301e-08
h	71	2	296997405	6.73406557205e-09
j	73	4	296997405	1.34681311441e-08
m	76	3	296997405	1.01010983581e-08
s	82	3	296997405	1.01010983581e-08