Trimming and mapping FASTQ


1. Raw FASTQ files preprocessing

(Latest) library structure: [Alu primer - 12 bp][Alu sequence - 6 bp][Flank][Adapter1 - 10 bp][UMI, barcode - 9 bp][Adapter2 - 12 bp]
Variable parameters:
  • MIST - max hamming (for primer and ads)
  • SHIFT - for search primer or ads
  • PRIMER - seq of primer
  • AD1 - seq of adapter1
  • AD2 - seq of adapter2
  • BLEN - len of UMI(barcode)
  • RESTRICT_SITE - 'AGCT' or 'CTAG'
  • R2_START - seq after adapters
  • CHAOS - mixed files
  • N_CORE - number of active cores ##### Steps:
  • inputdir - folder with raw *.fastq.gz
  • separate reads into good and bad (fastq)
    • by primer and ads
    • by restrict_site
    • by r2_start
    • by primer and ads in flank
  • trimming primer, umi, ads but save info in header
  • save reason for bad (fastq)

In [ ]:
# All imports here, don't touch it unless you are a developer
import imp
import trimmALU
imp.reload(trimmALU)

# Variable parameters
MIST = 1
SHIFT = 4
PRIMER = 'GAGCCACCGCGC'
AD1 = 'GCGTGCTGCGG'
AD2 = 'AGGGCGGT'
BLEN = 9
RESTRICT_SITE = 'AGCT'
R2_START = 'CT'
CHAOS = False
N_CORE = 2

# Input FASTQ files folder path
INPUTDIR = '~/data'
# Output folder for processed FASTQ files.
OUTPUTDIR = '~/data_processed'

tmp = trimmALU.main(inputdir = INPUTDIR,
                    outputdir = OUTPUTDIR,
                    primer = PRIMER,
                    ad1 = AD1,
                    ad2 = AD2,
                    blen = BLEN,
                    shift = SHIFT,
                    mist = MIST,
                    restrict_site = RESTRICT_SITE,
                    r2_start = R2_START,
                    chaos = CHAOS,
                    n_core = N_CORE)