Count all reads

The htseq-count version must comply with the python version in the environment. Officially, htseq only supports python2. However, there is a more or less fully functional python3 branch here: https://github.com/simon-anders/htseq/tree/python3

To install it in your environment follow the instructions in README.

Since htseq-count itself is not multithreaded, the best we can do is to launch a separate process for each sample and disown it (send to background). As long as the number of samples is less than the number of cores on snowflake we should be fine.

Change these values.



In [ ]:

    
num_samples=6
index="../ref/MG1655"
gff="../ref/NC_000913.gff"
sampleid="sample"

counter=$(which htseq-count)



In [ ]:

    
for i in $(seq 1 $num_samples)
do
    sample="${sampleid}${i}"
    result_dir="../results/${sample}"
    bamfile="$result_dir/${sample}_sorted.bam"
    echo "Processing: $bamfile"
    $counter -q -s no -t gene -i Name -r pos -a 0 -f bam \
        $bamfile $gff \
        > "../results/${sample}/$(basename $bamfile .bam).counts" &
done



In [ ]:

    
ls -lah ../results/sample1