Esercizio 6

Dato un file in formato GTF (Gene Transfer Format) che annota un set di geni sulla stessa genomica di riferimento, e il file della genomica di riferimento (genomic reference) in formato FASTA, produrre in output:

  • per ognuno degli esoni annotati:
    • l'elenco dei trascritti che includono l'esone
    • l'elenco dei trascritti per cui l'esone è coperto completamente da coding sequence, specificando la suddivisione in codoni in relazione a ognuno dei trascritti elencati

Parametri in input:

  • nome del file in formato GTF
  • nome del file della genomic reference in formato FASTA

Requisiti:

  • deve essere definita una funzione reverse_complement_in_case() che prenda come argomento una sequenza nucleotidica e un valore di strand e ne restituisca il reverse&complement se lo strand è -, altrimenti restituisce la sequenza così com'è
  • deve essere definita una funzione codon_separating() che prenda come argomento una sequenza nucleotidica e il frame in {0, 1, 2}, operi la suddivisione in codoni (tenendo conto del valore di frame) e restituisca una stringa che unisca i codoni usando il carattere di spazio come separatore

NOTA BENE: gli attributi (coppie nome-valore) del nono campo del file GTF non devono essere pensati a ordine fisso all'interno del campo. Per estrarre quindi un attributo, non si può usare il metodo split(), ma si deve necessariamente usare un'espressione regolare.


Variabili di output:

  • (per il primo punto) exon_inclusion_dict: dizionario di inclusione degli esoni:
    • chiave: esone inteso come tupla (start, end)
    • valore: set dei trascritti che includono l'esone (ogni trascritto deve essere rappresentato della tupla (transcript_id, gene_id))
  • (per il secondo punto) exon_coverage_list: lista degli esoni coperti interamente da coding sequence con la relativa suddivisione in codoni; ogni elemento della lista è una tupla dei seguenti cinque elementi: transcript_id, gene_id, start dell'esone, end dell'esone, suddivisione in codoni (tenendo conto del valore di frame). NOTA BENE: uno stesso esone può comparire in elementi diversi della lista dal momento che può essere incluso in trascritti diversi.

Note sul formato GTF

Feature e record GTF

Una feature GTF è un intervallo di posizioni sulla genomic reference che ha un certo significato funzionale, ad esempio un esone (feature di tipo exon) o un frammento della coding sequence di un trascritto (feature di tipo CDS).

Un record GTF rappresenta una feature di un certo tipo inclusa in un determinato trascritto (di un determinato gene). Il tipo di feature è specificato nel terzo campo del record.

Ad esempio il record:

ENm006 VEGA_Known   exon    64566   64757   .   -   .   transcript_id "U52112.4-014"; gene_id "ARHGAP4";

rappresenta un esone (feature di tipo exon) che inizia in posizione 64566 e finisce in posizione 64757, incluso (cioé che compone) nel trascritto U52112.4-014 del gene ARHGAP4.

Invece il record:

ENm006 VEGA_Known CDS   70312   70440   .   -   0   transcript_id "U52112.4-005"; gene_id "ARHGAP4";

rappresenta un frammento della coding sequence (feature di tipo CDS) del trascritto U52112.4-005 del gene ARHGAP4, mappato sulla genomic reference dalla posizione 70312 alla posizione 70440.

Esone incluso in trascritti diversi

N record di tipo exon, che corrispondono alla stessa feature sulla genomic reference (stesso intervallo di posizioni) rappresentano lo stesso esone incluso in N trascritti diversi.

Ad esempio i record seguenti:

ENm006 VEGA_Known   exon    64566   64757   .   -   .   transcript_id "U52112.4-014"; gene_id "ARHGAP4";
ENm006 VEGA_Known   exon    64566   64757   .   -   .   transcript_id "U52112.4-002"; gene_id "ARHGAP4";
ENm006 VEGA_Known   exon    64566   64757   .   -   .   transcript_id "U52112.4-003"; gene_id "ARHGAP4";
ENm006 VEGA_Known   exon    64566   64757   .   -   .   transcript_id "U52112.4-001"; gene_id "ARHGAP4";
ENm006 VEGA_Known   exon    64566   64757   .   -   .   transcript_id "U52112.4-024"; gene_id "ARHGAP4";
ENm006 VEGA_Known   exon    64566   64757   .   -   .   transcript_id "U52112.4-011"; gene_id "ARHGAP4";

rappresentano l'esone [64566, 64757] incluso in sei trascritti diversi del gene ARHGAP4.

Esone coperto da coding sequence per un dato trascritto

Un esone incluso in un trascritto è coperto completamente da coding sequence se, oltre al record di tipo exon che rappresenta l'esone incluso nel trascritto, esiste anche un record di tipo CDS corrispondente allo stesso intervallo di posizioni, che rappresenta un frammento della coding sequence dello stesso trascritto.

Ad esempio i due record:

ENm006 VEGA_Known exon  70312   70440   .   -   .   transcript_id "U52112.4-005"; gene_id "ARHGAP4";
ENm006 VEGA_Known CDS   70312   70440   .   -   0   transcript_id "U52112.4-005"; gene_id "ARHGAP4";

indicano che l'esone [70312, 70440] (incluso nel trascritto U52112.4-005del gene ARHGAP4) è coperto completamente da coding sequence.

Suddivisione in codoni di una feature di tipo CDS di un dato trascritto

La suddivisione in codoni di una feature di tipo CDS di un determinato trascritto deve tenere conto del valore del campo frame (ottavo campo del record), che specifica la posizione della prima base della feature all'interno del codone di appartenenza.

Ad esempio il record di tipo CDS seguente:

ENm006 VEGA_Known CDS   70312   70440   .   -   0   transcript_id "U52112.4-005"; gene_id "ARHGAP4";

rappresenta un frammento della coding sequence del trascritto U52112.4-005 del gene ARHGAP4. Il valore 0 del campo frame indica che la prima base della sequenza della feature è la prima base di un codone (cioé le prime tre basi della feature sono un codone completo). Tenendo presente che la sequenza della feature estratta dalla genomic reference (tenendo conto dello strand -) è:

cggcaggccaagttcatggagcacaaactcaagtgcacaaaggcgcgcaacgagtacctgcttagcctggctagtgtcaacgctgctgtcagtaactactacctgcatgacgtcttggacctcatggac

la sua suddivisione in codoni sarà dunque:

cgg cag gcc aag ttc atg gag cac aaa ctc aag tgc aca aag gcg cgc aac gag tac ctg ctt agc ctg gct agt gtc aac gct gct gtc agt aac tac tac ctg cat gac gtc ttg gac ctc atg gac

Il record di tipo CDS seguente:

ENm006 VEGA_Known CDS   72521   72683   .   -   1   transcript_id "U52112.4-019"; gene_id "ARHGAP4";

rappresenta un frammento della coding sequence del trascritto U52112.4-019 del gene ARHGAP4. Il valore 1 del campo frame indica che la prima base della sequenza della feature è la seconda base di un codone (cioé le prime due basi della feature sono le ultime due basi di un codone la cui prima base sarà l'ultima di una feature CDS diversa). Tenendo presente che la sequenza della feature estratta dalla genomic reference (tenendo conto dello strand -) è:

gaaggagccgtccctcctgtcgcccttgcactgctgggcggtgctgctgcagcacacgcggcagcagagccgggagagcgcggccctgagtgaggtgctggccgggcccctggcccagcgcctgagtcacattgcagaggacgtggggcgcctggtcaagaag

la sua suddivisione in codoni sarà dunque:

ga agg agc cgt ccc tcc tgt cgc cct tgc act gct ggg cgg tgc tgc tgc agc aca cgc ggc agc aga gcc ggg aga gcg cgg ccc tga gtg agg tgc tgg ccg ggc ccc tgg ccc agc gcc tga gtc aca ttg cag agg acg tgg ggc gcc tgg tca aga ag

Il record di tipo CDS seguente:

ENm006 VEGA_Known CDS   72761   72965   .   -   2   transcript_id "U52112.4-003"; gene_id "ARHGAP4";

rappresenta un frammento della coding sequence del trascritto U52112.4-003 del gene ARHGAP4. Il valore 2 del campo frame indica che la prima base della sequenza della feature è la terza base di un codone (cioé la prima base della feature è l'ultima base di un codone le cui prime due basi saranno le ultime di una feature CDS diversa). Tenendo presente che la sequenza della feature estratta dalla genomic reference (tenendo conto dello strand -) è:

agatgcgctggcagctgagcgagcagctgcgctgcctggagctgcagggcgagctgcggcgggagttgctgcaggagctggcagagttcatgcggcgccgcgctgaggtggagctggaatactcccggggcctggaaaagctggccgagcgcttctccagccgtggaggccgcctggggagcagccgggagcaccaaagcttccg

la sua suddivisione in codoni sarà dunque:

a gat gcg ctg gca gct gag cga gca gct gcg ctg cct gga gct gca ggg cga gct gcg gcg gga gtt gct gca gga gct ggc aga gtt cat gcg gcg ccg cgc tga ggt gga gct gga ata ctc ccg ggg cct gga aaa gct ggc cga gcg ctt ctc cag ccg tgg agg ccg cct ggg gag cag ccg gga gca cca aag ctt ccg

Soluzione

Importare il modulo re per usare le espressioni regolari.


In [1]:
import re

Definizione della funzione reverse_complement_in_case()


In [2]:
def reverse_complement_in_case(nucleotide_sequence, strand):
    complement_dict = {'a' : 't', 't' : 'a', 'c' : 'g', 'g' : 'c'}
    if strand == '-':
        return ''.join([complement_dict[c] for c in nucleotide_sequence.lower()[::-1]])
    else:
        return nucleotide_sequence.lower()

NOTA BENE: fare in modo che la funzione restituisca sempre una versione della sequenza in minuscolo.

Definizione della funzione coding_separating()


In [3]:
def codon_separating(nucleotide_sequence, frame):
    return ' '.join([nucleotide_sequence[:3-frame]]+re.findall('\w{1,3}', nucleotide_sequence[3-frame:]))

nucleotide_sequence[:3-frame] restituisce:

  • le prime tre basi della sequenza se frame = 0
  • le prime due basi della sequenza se frame = 1
  • la prima base della sequenza se frame = 2

re.findall('\w{1,3}', nucleotide_sequence[3-frame:]) restituisce la lista delle triplette (codoni) della sequenza a partire:

  • dalla quarta base in poi se frame = 0
  • dalla terza base in poi se frame = 0
  • dalla seconda base in poi se frame = 0

NOTA BENE: l'ultimo elemento della lista potrebbe anche essere una stringa di uno o due caratteri (e non un codone completo).

Parametri in input


In [4]:
gtf_file_name = './input.gtf'
reference_file_name = './ENm006.fa'

Lettura della genomica di riferimento e memorizzazione nella variabile genomic_reference

Estrazione delle righe del file della genomica di riferimento nella lista reference_file_rows


In [5]:
with open(reference_file_name, 'r') as reference_input_file:
    reference_file_rows = reference_input_file.readlines()

In [6]:
#reference_file_rows

Concatenazione delle righe contenenti la sequenza nucleotidica (dopo avere eliminato il simbolo di newline finale) nella variabile genomic_reference


In [7]:
genomic_reference = ''.join([row.rstrip() for row in reference_file_rows[1:]])

In [8]:
#genomic_reference

Lettura dei record del file GTF e memorizzazione nella lista gtf_file_rows


In [9]:
with open(gtf_file_name, 'r') as gtf_input_file:
    gtf_file_rows = gtf_input_file.readlines()

In [10]:
#gtf_file_rows

Selezione dei record di tipo exon e di tipo CDS

Separare i record di tipo exon e i record di tipo CDS in due liste distinte exon_gtf_rows e cds_gtf_rows


In [11]:
exon_gtf_rows = [row for row in gtf_file_rows if row.rstrip().split('\t')[2] == 'exon']
cds_gtf_rows = [row for row in gtf_file_rows if row.rstrip().split('\t')[2] == 'CDS']

In [12]:
#exon_gtf_rows

In [13]:
#cds_gtf_rows

Costruzione del dizionario exon_inclusion_dict di inclusione degli esoni

A partire dalla lista exon_gtf_rows costruire:

il dizionario exon_inclusion_dict:

  • chiave: tupla (start, end) che rappresenta un esone (feature di tipo exon)
  • valore: set dei trascritti che includono l'esone (ogni trascritto è rappresentato dalla tupla (transcript_id, gene_id) in modo da riferire il trascritto al gene di appartenenza)

e il dizionario strand_dict:

  • chiave: gene_id (Hugo Name)
  • valore: strand del gene

Inizializzazione dei dizionari vuoti.


In [14]:
exon_inclusion_dict = dict()
strand_dict = dict()

Attraversare la lista exon_gtf_rows per riempire i due dizionari.


In [15]:
for row in exon_gtf_rows:
    transcript_id = re.findall('transcript_id\s+"([^"]+)";', row.rstrip().split('\t')[8])[0]
    gene_id = re.findall('gene_id\s+"([^"]+)";', row.rstrip().split('\t')[8])[0]

    strand = row.rstrip().split('\t')[6]
    exon_start = int(row.rstrip().split('\t')[3])
    exon_end = int(row.rstrip().split('\t')[4])
    
    strand_dict[gene_id] = strand
    
    exon_set = exon_inclusion_dict.get((exon_start, exon_end), set())
    exon_set.add((transcript_id, gene_id))
    exon_inclusion_dict.update([((exon_start, exon_end), exon_set)])

In [16]:
strand_dict


Out[16]:
{'ARHGAP4': '-', 'ATP6AP1': '+', 'AVPR2': '+'}

In [17]:
exon_inclusion_dict


Out[17]:
{(71783, 71788): {('U52112.4-005', 'ARHGAP4')},
 (70312, 70440): {('U52112.4-001', 'ARHGAP4'),
  ('U52112.4-002', 'ARHGAP4'),
  ('U52112.4-003', 'ARHGAP4'),
  ('U52112.4-004', 'ARHGAP4'),
  ('U52112.4-005', 'ARHGAP4'),
  ('U52112.4-011', 'ARHGAP4'),
  ('U52112.4-014', 'ARHGAP4'),
  ('U52112.4-015', 'ARHGAP4'),
  ('U52112.4-020', 'ARHGAP4'),
  ('U52112.4-022', 'ARHGAP4')},
 (69989, 70210): {('U52112.4-001', 'ARHGAP4'),
  ('U52112.4-002', 'ARHGAP4'),
  ('U52112.4-003', 'ARHGAP4'),
  ('U52112.4-005', 'ARHGAP4'),
  ('U52112.4-011', 'ARHGAP4'),
  ('U52112.4-012', 'ARHGAP4'),
  ('U52112.4-014', 'ARHGAP4'),
  ('U52112.4-022', 'ARHGAP4')},
 (64935, 65036): {('U52112.4-001', 'ARHGAP4'),
  ('U52112.4-002', 'ARHGAP4'),
  ('U52112.4-003', 'ARHGAP4'),
  ('U52112.4-005', 'ARHGAP4'),
  ('U52112.4-011', 'ARHGAP4'),
  ('U52112.4-012', 'ARHGAP4'),
  ('U52112.4-013', 'ARHGAP4'),
  ('U52112.4-014', 'ARHGAP4'),
  ('U52112.4-022', 'ARHGAP4'),
  ('U52112.4-024', 'ARHGAP4')},
 (64566, 64673): {('U52112.4-005', 'ARHGAP4')},
 (64385, 64459): {('U52112.4-005', 'ARHGAP4')},
 (79484, 79511): {('U52112.4-018', 'ARHGAP4')},
 (72761, 72965): {('U52112.4-001', 'ARHGAP4'),
  ('U52112.4-002', 'ARHGAP4'),
  ('U52112.4-003', 'ARHGAP4'),
  ('U52112.4-004', 'ARHGAP4'),
  ('U52112.4-011', 'ARHGAP4'),
  ('U52112.4-012', 'ARHGAP4'),
  ('U52112.4-013', 'ARHGAP4'),
  ('U52112.4-014', 'ARHGAP4'),
  ('U52112.4-015', 'ARHGAP4'),
  ('U52112.4-017', 'ARHGAP4'),
  ('U52112.4-018', 'ARHGAP4'),
  ('U52112.4-019', 'ARHGAP4'),
  ('U52112.4-022', 'ARHGAP4'),
  ('U52112.4-024', 'ARHGAP4')},
 (72521, 72683): {('U52112.4-001', 'ARHGAP4'),
  ('U52112.4-002', 'ARHGAP4'),
  ('U52112.4-003', 'ARHGAP4'),
  ('U52112.4-004', 'ARHGAP4'),
  ('U52112.4-011', 'ARHGAP4'),
  ('U52112.4-012', 'ARHGAP4'),
  ('U52112.4-013', 'ARHGAP4'),
  ('U52112.4-014', 'ARHGAP4'),
  ('U52112.4-015', 'ARHGAP4'),
  ('U52112.4-017', 'ARHGAP4'),
  ('U52112.4-018', 'ARHGAP4'),
  ('U52112.4-019', 'ARHGAP4'),
  ('U52112.4-022', 'ARHGAP4'),
  ('U52112.4-024', 'ARHGAP4')},
 (72253, 72379): {('U52112.4-018', 'ARHGAP4')},
 (71896, 71965): {('U52112.4-018', 'ARHGAP4')},
 (77293, 77462): {('U52112.4-014', 'ARHGAP4')},
 (72253, 72315): {('U52112.4-001', 'ARHGAP4'),
  ('U52112.4-002', 'ARHGAP4'),
  ('U52112.4-003', 'ARHGAP4'),
  ('U52112.4-004', 'ARHGAP4'),
  ('U52112.4-014', 'ARHGAP4'),
  ('U52112.4-015', 'ARHGAP4'),
  ('U52112.4-017', 'ARHGAP4'),
  ('U52112.4-019', 'ARHGAP4'),
  ('U52112.4-021', 'ARHGAP4'),
  ('U52112.4-022', 'ARHGAP4'),
  ('U52112.4-024', 'ARHGAP4')},
 (71783, 71993): {('U52112.4-014', 'ARHGAP4')},
 (64566, 64757): {('U52112.4-001', 'ARHGAP4'),
  ('U52112.4-002', 'ARHGAP4'),
  ('U52112.4-003', 'ARHGAP4'),
  ('U52112.4-011', 'ARHGAP4'),
  ('U52112.4-014', 'ARHGAP4'),
  ('U52112.4-024', 'ARHGAP4')},
 (64375, 64459): {('U52112.4-001', 'ARHGAP4'),
  ('U52112.4-002', 'ARHGAP4'),
  ('U52112.4-003', 'ARHGAP4'),
  ('U52112.4-011', 'ARHGAP4'),
  ('U52112.4-014', 'ARHGAP4'),
  ('U52112.4-024', 'ARHGAP4')},
 (64181, 64208): {('U52112.4-001', 'ARHGAP4'),
  ('U52112.4-002', 'ARHGAP4'),
  ('U52112.4-003', 'ARHGAP4'),
  ('U52112.4-011', 'ARHGAP4'),
  ('U52112.4-012', 'ARHGAP4'),
  ('U52112.4-013', 'ARHGAP4'),
  ('U52112.4-014', 'ARHGAP4'),
  ('U52112.4-024', 'ARHGAP4')},
 (63857, 63959): {('U52112.4-001', 'ARHGAP4'),
  ('U52112.4-002', 'ARHGAP4'),
  ('U52112.4-003', 'ARHGAP4'),
  ('U52112.4-011', 'ARHGAP4'),
  ('U52112.4-012', 'ARHGAP4'),
  ('U52112.4-013', 'ARHGAP4'),
  ('U52112.4-014', 'ARHGAP4'),
  ('U52112.4-024', 'ARHGAP4')},
 (62286, 62346): {('U52112.4-001', 'ARHGAP4'),
  ('U52112.4-002', 'ARHGAP4'),
  ('U52112.4-003', 'ARHGAP4'),
  ('U52112.4-010', 'ARHGAP4'),
  ('U52112.4-011', 'ARHGAP4'),
  ('U52112.4-012', 'ARHGAP4'),
  ('U52112.4-013', 'ARHGAP4'),
  ('U52112.4-014', 'ARHGAP4'),
  ('U52112.4-016', 'ARHGAP4'),
  ('U52112.4-024', 'ARHGAP4')},
 (62079, 62156): {('U52112.4-001', 'ARHGAP4'),
  ('U52112.4-003', 'ARHGAP4'),
  ('U52112.4-011', 'ARHGAP4'),
  ('U52112.4-012', 'ARHGAP4'),
  ('U52112.4-013', 'ARHGAP4'),
  ('U52112.4-014', 'ARHGAP4'),
  ('U52112.4-016', 'ARHGAP4'),
  ('U52112.4-024', 'ARHGAP4')},
 (61857, 61991): {('U52112.4-001', 'ARHGAP4'),
  ('U52112.4-002', 'ARHGAP4'),
  ('U52112.4-003', 'ARHGAP4'),
  ('U52112.4-009', 'ARHGAP4'),
  ('U52112.4-010', 'ARHGAP4'),
  ('U52112.4-011', 'ARHGAP4'),
  ('U52112.4-012', 'ARHGAP4'),
  ('U52112.4-013', 'ARHGAP4'),
  ('U52112.4-014', 'ARHGAP4'),
  ('U52112.4-016', 'ARHGAP4'),
  ('U52112.4-024', 'ARHGAP4')},
 (61663, 61768): {('U52112.4-001', 'ARHGAP4'),
  ('U52112.4-003', 'ARHGAP4'),
  ('U52112.4-008', 'ARHGAP4'),
  ('U52112.4-009', 'ARHGAP4'),
  ('U52112.4-010', 'ARHGAP4'),
  ('U52112.4-011', 'ARHGAP4'),
  ('U52112.4-012', 'ARHGAP4'),
  ('U52112.4-013', 'ARHGAP4'),
  ('U52112.4-014', 'ARHGAP4'),
  ('U52112.4-024', 'ARHGAP4')},
 (61328, 61561): {('U52112.4-001', 'ARHGAP4'),
  ('U52112.4-003', 'ARHGAP4'),
  ('U52112.4-008', 'ARHGAP4'),
  ('U52112.4-010', 'ARHGAP4'),
  ('U52112.4-011', 'ARHGAP4'),
  ('U52112.4-012', 'ARHGAP4'),
  ('U52112.4-013', 'ARHGAP4'),
  ('U52112.4-014', 'ARHGAP4'),
  ('U52112.4-024', 'ARHGAP4')},
 (61169, 61242): {('U52112.4-001', 'ARHGAP4'),
  ('U52112.4-002', 'ARHGAP4'),
  ('U52112.4-003', 'ARHGAP4'),
  ('U52112.4-008', 'ARHGAP4'),
  ('U52112.4-009', 'ARHGAP4'),
  ('U52112.4-010', 'ARHGAP4'),
  ('U52112.4-011', 'ARHGAP4'),
  ('U52112.4-012', 'ARHGAP4'),
  ('U52112.4-013', 'ARHGAP4'),
  ('U52112.4-014', 'ARHGAP4'),
  ('U52112.4-016', 'ARHGAP4'),
  ('U52112.4-024', 'ARHGAP4')},
 (60898, 61081): {('U52112.4-001', 'ARHGAP4'),
  ('U52112.4-002', 'ARHGAP4'),
  ('U52112.4-003', 'ARHGAP4'),
  ('U52112.4-008', 'ARHGAP4'),
  ('U52112.4-009', 'ARHGAP4'),
  ('U52112.4-010', 'ARHGAP4'),
  ('U52112.4-011', 'ARHGAP4'),
  ('U52112.4-012', 'ARHGAP4'),
  ('U52112.4-013', 'ARHGAP4'),
  ('U52112.4-014', 'ARHGAP4'),
  ('U52112.4-016', 'ARHGAP4'),
  ('U52112.4-024', 'ARHGAP4')},
 (60600, 60692): {('U52112.4-001', 'ARHGAP4'),
  ('U52112.4-002', 'ARHGAP4'),
  ('U52112.4-003', 'ARHGAP4'),
  ('U52112.4-011', 'ARHGAP4'),
  ('U52112.4-012', 'ARHGAP4'),
  ('U52112.4-013', 'ARHGAP4'),
  ('U52112.4-014', 'ARHGAP4'),
  ('U52112.4-016', 'ARHGAP4'),
  ('U52112.4-024', 'ARHGAP4')},
 (60227, 60326): {('U52112.4-001', 'ARHGAP4'),
  ('U52112.4-002', 'ARHGAP4'),
  ('U52112.4-003', 'ARHGAP4'),
  ('U52112.4-006', 'ARHGAP4'),
  ('U52112.4-011', 'ARHGAP4'),
  ('U52112.4-012', 'ARHGAP4'),
  ('U52112.4-013', 'ARHGAP4'),
  ('U52112.4-014', 'ARHGAP4'),
  ('U52112.4-016', 'ARHGAP4'),
  ('U52112.4-024', 'ARHGAP4')},
 (58626, 59119): {('U52112.4-014', 'ARHGAP4')},
 (86040, 86155): {('U52112.4-022', 'ARHGAP4')},
 (85533, 85631): {('U52112.4-022', 'ARHGAP4')},
 (85099, 85157): {('U52112.4-022', 'ARHGAP4')},
 (83695, 83740): {('U52112.4-022', 'ARHGAP4')},
 (83472, 83587): {('U52112.4-022', 'ARHGAP4')},
 (83227, 83271): {('U52112.4-022', 'ARHGAP4')},
 (71783, 71965): {('U52112.4-001', 'ARHGAP4'),
  ('U52112.4-002', 'ARHGAP4'),
  ('U52112.4-003', 'ARHGAP4'),
  ('U52112.4-004', 'ARHGAP4'),
  ('U52112.4-011', 'ARHGAP4'),
  ('U52112.4-015', 'ARHGAP4'),
  ('U52112.4-020', 'ARHGAP4'),
  ('U52112.4-022', 'ARHGAP4')},
 (64367, 64757): {('U52112.4-022', 'ARHGAP4')},
 (72521, 72556): {('U52112.4-021', 'ARHGAP4')},
 (71569, 71965): {('U52112.4-021', 'ARHGAP4')},
 (60600, 60815): {('U52112.4-006', 'ARHGAP4')},
 (58596, 59119): {('U52112.4-006', 'ARHGAP4')},
 (77973, 78017): {('U52112.4-019', 'ARHGAP4')},
 (71872, 71965): {('U52112.4-019', 'ARHGAP4')},
 (79484, 79576): {('U52112.4-017', 'ARHGAP4')},
 (71865, 71965): {('U52112.4-017', 'ARHGAP4')},
 (77293, 77401): {('U52112.4-002', 'ARHGAP4'),
  ('U52112.4-003', 'ARHGAP4'),
  ('U52112.4-011', 'ARHGAP4'),
  ('U52112.4-012', 'ARHGAP4'),
  ('U52112.4-013', 'ARHGAP4')},
 (64375, 64757): {('U52112.4-012', 'ARHGAP4'), ('U52112.4-013', 'ARHGAP4')},
 (58611, 59119): {('U52112.4-011', 'ARHGAP4'),
  ('U52112.4-012', 'ARHGAP4'),
  ('U52112.4-013', 'ARHGAP4')},
 (58534, 59119): {('U52112.4-002', 'ARHGAP4'), ('U52112.4-003', 'ARHGAP4')},
 (63857, 63894): {('U52112.4-016', 'ARHGAP4')},
 (61328, 61768): {('U52112.4-016', 'ARHGAP4')},
 (58572, 59119): {('U52112.4-016', 'ARHGAP4')},
 (63857, 63942): {('U52112.4-010', 'ARHGAP4')},
 (62079, 62258): {('U52112.4-009', 'ARHGAP4')},
 (60625, 60692): {('U52112.4-009', 'ARHGAP4')},
 (72521, 72560): {('U52112.4-020', 'ARHGAP4')},
 (70724, 70843): {('U52112.4-001', 'ARHGAP4'), ('U52112.4-020', 'ARHGAP4')},
 (70097, 70210): {('U52112.4-020', 'ARHGAP4')},
 (77293, 77417): {('U52112.4-001', 'ARHGAP4')},
 (58533, 59119): {('U52112.4-001', 'ARHGAP4')},
 (77293, 77411): {('U52112.4-024', 'ARHGAP4')},
 (58524, 59119): {('U52112.4-024', 'ARHGAP4')},
 (61169, 61676): {('U52112.4-007', 'ARHGAP4')},
 (60976, 61081): {('U52112.4-007', 'ARHGAP4')},
 (77747, 77914): {('U52112.4-015', 'ARHGAP4')},
 (69999, 70210): {('U52112.4-015', 'ARHGAP4')},
 (61857, 62028): {('U52112.4-008', 'ARHGAP4')},
 (60665, 60692): {('U52112.4-008', 'ARHGAP4')},
 (73280, 73404): {('U52112.4-004', 'ARHGAP4')},
 (70119, 70210): {('U52112.4-004', 'ARHGAP4')},
 (64935, 65410): {('U52112.4-023', 'ARHGAP4')},
 (64681, 64757): {('U52112.4-023', 'ARHGAP4')},
 (542747, 542902): {('XX-FW83563B9.4-002', 'ATP6AP1')},
 (543097, 545706): {('XX-FW83563B9.4-002', 'ATP6AP1')},
 (545879, 545953): {('XX-FW83563B9.4-001', 'ATP6AP1'),
  ('XX-FW83563B9.4-002', 'ATP6AP1'),
  ('XX-FW83563B9.4-003', 'ATP6AP1'),
  ('XX-FW83563B9.4-004', 'ATP6AP1'),
  ('XX-FW83563B9.4-006', 'ATP6AP1')},
 (546315, 546508): {('XX-FW83563B9.4-001', 'ATP6AP1'),
  ('XX-FW83563B9.4-002', 'ATP6AP1'),
  ('XX-FW83563B9.4-004', 'ATP6AP1')},
 (546980, 547020): {('XX-FW83563B9.4-001', 'ATP6AP1'),
  ('XX-FW83563B9.4-002', 'ATP6AP1'),
  ('XX-FW83563B9.4-004', 'ATP6AP1')},
 (547684, 547769): {('XX-FW83563B9.4-001', 'ATP6AP1'),
  ('XX-FW83563B9.4-002', 'ATP6AP1'),
  ('XX-FW83563B9.4-004', 'ATP6AP1')},
 (548257, 548495): {('XX-FW83563B9.4-001', 'ATP6AP1'),
  ('XX-FW83563B9.4-002', 'ATP6AP1'),
  ('XX-FW83563B9.4-003', 'ATP6AP1'),
  ('XX-FW83563B9.4-004', 'ATP6AP1'),
  ('XX-FW83563B9.4-006', 'ATP6AP1')},
 (542694, 542902): {('XX-FW83563B9.4-003', 'ATP6AP1')},
 (543097, 543223): {('XX-FW83563B9.4-001', 'ATP6AP1'),
  ('XX-FW83563B9.4-003', 'ATP6AP1'),
  ('XX-FW83563B9.4-006', 'ATP6AP1')},
 (546315, 546425): {('XX-FW83563B9.4-003', 'ATP6AP1')},
 (542894, 543223): {('XX-FW83563B9.4-004', 'ATP6AP1')},
 (542790, 542902): {('XX-FW83563B9.4-006', 'ATP6AP1')},
 (547709, 547769): {('XX-FW83563B9.4-006', 'ATP6AP1')},
 (542687, 542902): {('XX-FW83563B9.4-001', 'ATP6AP1')},
 (56271, 56327): {('U52112.2-003', 'AVPR2')},
 (56689, 57938): {('U52112.2-003', 'AVPR2')},
 (53688, 54049): {('U52112.2-002', 'AVPR2')},
 (56131, 56327): {('U52112.2-001', 'AVPR2'), ('U52112.2-002', 'AVPR2')},
 (57176, 57573): {('U52112.2-002', 'AVPR2')},
 (57680, 58322): {('U52112.2-002', 'AVPR2')},
 (55892, 55928): {('U52112.2-001', 'AVPR2')},
 (56689, 57573): {('U52112.2-001', 'AVPR2')},
 (57680, 58323): {('U52112.2-001', 'AVPR2')}}

Determinazione degli esoni coperti completamente da coding sequence

A partire dalla lista cds_gtf_rows costruire il dizionario cds_inclusion_dict delle features di tipo CDS:

  • chiave: tupla (start, end) che rappresenta una feature di tipo CDS
  • valore: set dei trascritti per cui la feature è un frammento di coding sequence (ogni trascritto è rappresentato dalla tupla (transcript_id, gene_id) in modo da riferire il trascritto al gene di appartenenza)

Inizializzazione del dizionario vuoto.


In [18]:
cds_inclusion_dict = dict()

Attraversare la lista cds_gtf_rows per riempire il dizionario.


In [19]:
for row in cds_gtf_rows:
    transcript_id = re.findall('transcript_id\s+"([^"]+)";', row.rstrip().split('\t')[8])[0]
    gene_id = re.findall('gene_id\s+"([^"]+)";', row.rstrip().split('\t')[8])[0]

    cds_start = int(row.rstrip().split('\t')[3])
    cds_end = int(row.rstrip().split('\t')[4])
        
    cds_set = cds_inclusion_dict.get((cds_start, cds_end), set())
    cds_set.add((transcript_id, gene_id))
    cds_inclusion_dict.update([((cds_start, cds_end), cds_set)])

In [20]:
#cds_inclusion_dict

A partire dalla lista cds_gtf_rows costruire il dizionario frame_dict che permetterà di accedere, data una feature di tipo CDS, al valore di frame in relazione a tutti i trascritti per cui la feature è un frammento di coding sequence:

  • chiave: tupla (start, end) che rappresenta una feature di tipo CDS
  • valore: dizionario annidato (dict1 nel codice):
    • chiave: gene_id (Hugo Name)
    • valore: dizionario annidato (dict2 nel codice):
      • chiave: transcript_id
      • valore: frame

Inizializzazione del dizionario vuoto.


In [21]:
frame_dict = dict()

Attraversare la lista cds_gtf_rows per riempire il dizionario.


In [22]:
for row in cds_gtf_rows:
    transcript_id = re.findall('transcript_id\s+"([^"]+)";', row.rstrip().split('\t')[8])[0]
    gene_id = re.findall('gene_id\s+"([^"]+)";', row.rstrip().split('\t')[8])[0]

    cds_start = int(row.rstrip().split('\t')[3])
    cds_end = int(row.rstrip().split('\t')[4])
    frame = int(row.rstrip().split('\t')[7])
        
    dict1 = frame_dict.get((cds_start, cds_end),dict())
    dict2 = dict1.get(gene_id, dict())
    dict2.update([(transcript_id, frame)])
    dict1.update([(gene_id, dict2)])
    frame_dict.update([((cds_start, cds_end), dict1)])

In [23]:
#frame_dict

Le chiavi del dizionario cds_inclusion_dict che non compaiono in exon_inclusion_dict rappresentano features di tipo CDS che non coprono completamente un esone, e di conseguenza sono da scartare.

Basta cancellare quindi in cds_inclusion_dict le chiavi che appartengono alla differenza tra il set delle chiavi di cds_inclusion_dict e il set delle chiavi di exon_inclusion_dict.


In [24]:
key_to_discard = set(cds_inclusion_dict).difference(set(exon_inclusion_dict))

u_list = [cds_inclusion_dict.pop(del_key) for del_key in key_to_discard]

L'assegnamento alla lista u_list è solo un modo per evitare di vedere un output inutile.

A questo punto, data una chiave (start, end) in exon_inclusion_dict, si ha che il corrispondente valore è il set dei trascritti che includono l'esone (start, end). Alla stessa chiave in cds_inclusion_dict corrisponderà come valore il set dei trascritti per cui la feature di tipo CDS (start, end) è il frammento di coding sequence che copre completamente l'esone (start, end). L'intersezione tra questi due set fornisce il set dei trascritti per cui l'esone (start, end) è coperto completamente da coding sequence.

Per ogni chiave (start, end), i valori di cds_inclusion_dict devono dunque essere aggiornati con i risultato dell'intersezione tra il set corrispondente alla chiave (start, end) in cds_inclusion_dict e il set corrispondente alla stessa chiave in exon_inclusion_dict.


In [25]:
u_list= [cds_inclusion_dict.update([(key, exon_inclusion_dict[key].intersection(cds_inclusion_dict[key]))]) for key in cds_inclusion_dict]

In [26]:
#cds_inclusion_dict

Le chiavi del dizionario cds_inclusion_dict forniscono ora tutte le tuple (start, end) che rappresentano esoni coperti completamente da coding sequence, e i rispettivi valori sono i set dei trascritti per cui l'esone (start, end) è coperto completamente da coding sequence.

Basta quindi attraversare il dizionario cds_inclusion_dict per produrre la lista di output exon_coverage_list.

Per ogni chiave (start, end) si recupera il relativo set di trascritti (set di tuple (transcript_id, gene_id)).

Per ogni trascritto del set viene recuperato il valore di frame dal dizionario frame_dict, e lo strand del gene di riferimento dal dizionario strand_dict. Dalla sequenza della feature (ottenuta con la funzione reverse_complement_in_case()) viene poi ottenuta la stringa di suddivisione in codoni codon_string tramite la funzione codon_separating(). La tupla (transcript_id, gene_id, start, end, codon_string) viene aggiunta alla lista exon_coverage_list.


In [27]:
exon_coverage_list = []

for feature in cds_inclusion_dict:
    for transcript_tuple in cds_inclusion_dict[feature]:
        transcript_id = transcript_tuple[0]
        gene_id = transcript_tuple[1]

        frame = frame_dict[feature][gene_id][transcript_id]
        strand = strand_dict[gene_id]
        
        feature_sequence = reverse_complement_in_case(genomic_reference[feature[0]-1: feature[1]], strand)       
        exon_coverage_list.append((transcript_id, gene_id, feature[0], feature[1], codon_separating(feature_sequence, frame)))

In [28]:
exon_coverage_list


Out[28]:
[('U52112.4-005', 'ARHGAP4', 71783, 71788, 'gag aag'),
 ('U52112.4-001',
  'ARHGAP4',
  70312,
  70440,
  'cgg cag gcc aag ttc atg gag cac aaa ctc aag tgc aca aag gcg cgc aac gag tac ctg ctt agc ctg gct agt gtc aac gct gct gtc agt aac tac tac ctg cat gac gtc ttg gac ctc atg gac'),
 ('U52112.4-020',
  'ARHGAP4',
  70312,
  70440,
  'cgg cag gcc aag ttc atg gag cac aaa ctc aag tgc aca aag gcg cgc aac gag tac ctg ctt agc ctg gct agt gtc aac gct gct gtc agt aac tac tac ctg cat gac gtc ttg gac ctc atg gac'),
 ('U52112.4-011',
  'ARHGAP4',
  70312,
  70440,
  'cgg cag gcc aag ttc atg gag cac aaa ctc aag tgc aca aag gcg cgc aac gag tac ctg ctt agc ctg gct agt gtc aac gct gct gtc agt aac tac tac ctg cat gac gtc ttg gac ctc atg gac'),
 ('U52112.4-003',
  'ARHGAP4',
  70312,
  70440,
  'cgg cag gcc aag ttc atg gag cac aaa ctc aag tgc aca aag gcg cgc aac gag tac ctg ctt agc ctg gct agt gtc aac gct gct gtc agt aac tac tac ctg cat gac gtc ttg gac ctc atg gac'),
 ('U52112.4-005',
  'ARHGAP4',
  70312,
  70440,
  'cgg cag gcc aag ttc atg gag cac aaa ctc aag tgc aca aag gcg cgc aac gag tac ctg ctt agc ctg gct agt gtc aac gct gct gtc agt aac tac tac ctg cat gac gtc ttg gac ctc atg gac'),
 ('U52112.4-001',
  'ARHGAP4',
  69989,
  70210,
  'tgc tgt gac aca ggg ttc cac ctg gcc ctg ggg cag gtg ctc cgg agc tac acg gcc gct gag agc cgc acc caa gcc tcc caa gtg cag ggc ctg ggc agc ctg gaa gaa gct gtg gag gcc ctg gat cct cca ggg gac aaa gcc aag gtt ctc gag gtg cat gct acc gtc ttc tgt ccc ccg ctg cgc ttt gac tac cac ccc cat gat ggg gat gag'),
 ('U52112.4-011',
  'ARHGAP4',
  69989,
  70210,
  'tgc tgt gac aca ggg ttc cac ctg gcc ctg ggg cag gtg ctc cgg agc tac acg gcc gct gag agc cgc acc caa gcc tcc caa gtg cag ggc ctg ggc agc ctg gaa gaa gct gtg gag gcc ctg gat cct cca ggg gac aaa gcc aag gtt ctc gag gtg cat gct acc gtc ttc tgt ccc ccg ctg cgc ttt gac tac cac ccc cat gat ggg gat gag'),
 ('U52112.4-003',
  'ARHGAP4',
  69989,
  70210,
  'tgc tgt gac aca ggg ttc cac ctg gcc ctg ggg cag gtg ctc cgg agc tac acg gcc gct gag agc cgc acc caa gcc tcc caa gtg cag ggc ctg ggc agc ctg gaa gaa gct gtg gag gcc ctg gat cct cca ggg gac aaa gcc aag gtt ctc gag gtg cat gct acc gtc ttc tgt ccc ccg ctg cgc ttt gac tac cac ccc cat gat ggg gat gag'),
 ('U52112.4-005',
  'ARHGAP4',
  69989,
  70210,
  'tgc tgt gac aca ggg ttc cac ctg gcc ctg ggg cag gtg ctc cgg agc tac acg gcc gct gag agc cgc acc caa gcc tcc caa gtg cag ggc ctg ggc agc ctg gaa gaa gct gtg gag gcc ctg gat cct cca ggg gac aaa gcc aag gtt ctc gag gtg cat gct acc gtc ttc tgt ccc ccg ctg cgc ttt gac tac cac ccc cat gat ggg gat gag'),
 ('U52112.4-001',
  'ARHGAP4',
  64935,
  65036,
  'gtg gct gag atc tgc gtt gaa atg gag ctg cgg gac gag att ctg ccc aga gcc cag aac atc cag agc cgc ctg gac cga cag acc att gag aca gag gag'),
 ('U52112.4-011',
  'ARHGAP4',
  64935,
  65036,
  'gtg gct gag atc tgc gtt gaa atg gag ctg cgg gac gag att ctg ccc aga gcc cag aac atc cag agc cgc ctg gac cga cag acc att gag aca gag gag'),
 ('U52112.4-003',
  'ARHGAP4',
  64935,
  65036,
  'gtg gct gag atc tgc gtt gaa atg gag ctg cgg gac gag att ctg ccc aga gcc cag aac atc cag agc cgc ctg gac cga cag acc att gag aca gag gag'),
 ('U52112.4-005',
  'ARHGAP4',
  64935,
  65036,
  'gtg gct gag atc tgc gtt gaa atg gag ctg cgg gac gag att ctg ccc aga gcc cag aac atc cag agc cgc ctg gac cga cag acc att gag aca gag gag'),
 ('U52112.4-024',
  'ARHGAP4',
  64935,
  65036,
  'gtg gct gag atc tgc gtt gaa atg gag ctg cgg gac gag att ctg ccc aga gcc cag aac atc cag agc cgc ctg gac cga cag acc att gag aca gag gag'),
 ('U52112.4-005',
  'ARHGAP4',
  64566,
  64673,
  'acc agc ccc tcc acc gag tcc ctc aag tcc acc agc tca gac cca ggc agc cgg cag gcg ggc cgg agg cgc ggc cag cag cag gag acc gaa acc ttc tac ctc acg'),
 ('U52112.4-005',
  'ARHGAP4',
  64385,
  64459,
  'aag ctc cag gag tat ctg agt gga cgg agc atc ctc gcc aag ctg cag gcc aag cac gag aag ctg cag gag gcc'),
 ('U52112.4-001',
  'ARHGAP4',
  72521,
  72683,
  'ga agg agc cgt ccc tcc tgt cgc cct tgc act gct ggg cgg tgc tgc tgc agc aca cgc ggc agc aga gcc ggg aga gcg cgg ccc tga gtg agg tgc tgg ccg ggc ccc tgg ccc agc gcc tga gtc aca ttg cag agg acg tgg ggc gcc tgg tca aga ag'),
 ('U52112.4-011',
  'ARHGAP4',
  72521,
  72683,
  'ga agg agc cgt ccc tcc tgt cgc cct tgc act gct ggg cgg tgc tgc tgc agc aca cgc ggc agc aga gcc ggg aga gcg cgg ccc tga gtg agg tgc tgg ccg ggc ccc tgg ccc agc gcc tga gtc aca ttg cag agg acg tgg ggc gcc tgg tca aga ag'),
 ('U52112.4-019',
  'ARHGAP4',
  72521,
  72683,
  'ga agg agc cgt ccc tcc tgt cgc cct tgc act gct ggg cgg tgc tgc tgc agc aca cgc ggc agc aga gcc ggg aga gcg cgg ccc tga gtg agg tgc tgg ccg ggc ccc tgg ccc agc gcc tga gtc aca ttg cag agg acg tgg ggc gcc tgg tca aga ag'),
 ('U52112.4-003',
  'ARHGAP4',
  72521,
  72683,
  'ga agg agc cgt ccc tcc tgt cgc cct tgc act gct ggg cgg tgc tgc tgc agc aca cgc ggc agc aga gcc ggg aga gcg cgg ccc tga gtg agg tgc tgg ccg ggc ccc tgg ccc agc gcc tga gtc aca ttg cag agg acg tgg ggc gcc tgg tca aga ag'),
 ('U52112.4-017',
  'ARHGAP4',
  72521,
  72683,
  'ga agg agc cgt ccc tcc tgt cgc cct tgc act gct ggg cgg tgc tgc tgc agc aca cgc ggc agc aga gcc ggg aga gcg cgg ccc tga gtg agg tgc tgg ccg ggc ccc tgg ccc agc gcc tga gtc aca ttg cag agg acg tgg ggc gcc tgg tca aga ag'),
 ('U52112.4-024',
  'ARHGAP4',
  72521,
  72683,
  'ga agg agc cgt ccc tcc tgt cgc cct tgc act gct ggg cgg tgc tgc tgc agc aca cgc ggc agc aga gcc ggg aga gcg cgg ccc tga gtg agg tgc tgg ccg ggc ccc tgg ccc agc gcc tga gtc aca ttg cag agg acg tgg ggc gcc tgg tca aga ag'),
 ('U52112.4-001',
  'ARHGAP4',
  72253,
  72315,
  'agc agg gat ctg gag cag cag ctg cag gat gag ctc ctg gag gtg gtc tca gag ctc cag acg'),
 ('U52112.4-019',
  'ARHGAP4',
  72253,
  72315,
  'agc agg gat ctg gag cag cag ctg cag gat gag ctc ctg gag gtg gtc tca gag ctc cag acg'),
 ('U52112.4-003',
  'ARHGAP4',
  72253,
  72315,
  'agc agg gat ctg gag cag cag ctg cag gat gag ctc ctg gag gtg gtc tca gag ctc cag acg'),
 ('U52112.4-017',
  'ARHGAP4',
  72253,
  72315,
  'agc agg gat ctg gag cag cag ctg cag gat gag ctc ctg gag gtg gtc tca gag ctc cag acg'),
 ('U52112.4-024',
  'ARHGAP4',
  72253,
  72315,
  'agc agg gat ctg gag cag cag ctg cag gat gag ctc ctg gag gtg gtc tca gag ctc cag acg'),
 ('U52112.4-019',
  'ARHGAP4',
  71872,
  71965,
  'gcc aag aag acg tac cag gca tat cac atg gag agc gtg aat gcc gag gcc aag ctc cgg gag gcc gag cgg cag gag gag aag cgg gca ggc c'),
 ('U52112.4-017',
  'ARHGAP4',
  71865,
  71965,
  'gcc aag aag acg tac cag gca tat cac atg gag agc gtg aat gcc gag gcc aag ctc cgg gag gcc gag cgg cag gag gag aag cgg gca ggc cgg agt gt'),
 ('U52112.4-010',
  'ARHGAP4',
  63857,
  63942,
  'g aaa att cca gaa gag ccg cca gcc ccg ccc cag ctc cca gta taa cca gag act ctt tgg ggg aga cat gga gaa gtt tat cca g'),
 ('U52112.4-001',
  'ARHGAP4',
  62286,
  62346,
  'agc tca ggc cag cct gtg ccc ctg gtg gtg gag agc tgc att cgc ttc atc aac ctc aat g'),
 ('U52112.4-011',
  'ARHGAP4',
  62286,
  62346,
  'agc tca ggc cag cct gtg ccc ctg gtg gtg gag agc tgc att cgc ttc atc aac ctc aat g'),
 ('U52112.4-003',
  'ARHGAP4',
  62286,
  62346,
  'agc tca ggc cag cct gtg ccc ctg gtg gtg gag agc tgc att cgc ttc atc aac ctc aat g'),
 ('U52112.4-010',
  'ARHGAP4',
  62286,
  62346,
  'agc tca ggc cag cct gtg ccc ctg gtg gtg gag agc tgc att cgc ttc atc aac ctc aat g'),
 ('U52112.4-024',
  'ARHGAP4',
  62286,
  62346,
  'agc tca ggc cag cct gtg ccc ctg gtg gtg gag agc tgc att cgc ttc atc aac ctc aat g'),
 ('U52112.4-001',
  'ARHGAP4',
  61857,
  61991,
  'g gga gga ccc act ggt gga ggg ctg cac tgc cca tga cct gga ctc ggt ggc cgg ggt gct gaa gct cta ctt ccg gag cct gga gcc ccc act ctt ccc ccc aga cct gtt cgg cga gct gct ggc ttc ttc gg'),
 ('U52112.4-011',
  'ARHGAP4',
  61857,
  61991,
  'g gga gga ccc act ggt gga ggg ctg cac tgc cca tga cct gga ctc ggt ggc cgg ggt gct gaa gct cta ctt ccg gag cct gga gcc ccc act ctt ccc ccc aga cct gtt cgg cga gct gct ggc ttc ttc gg'),
 ('U52112.4-003',
  'ARHGAP4',
  61857,
  61991,
  'g gga gga ccc act ggt gga ggg ctg cac tgc cca tga cct gga ctc ggt ggc cgg ggt gct gaa gct cta ctt ccg gag cct gga gcc ccc act ctt ccc ccc aga cct gtt cgg cga gct gct ggc ttc ttc gg'),
 ('U52112.4-010',
  'ARHGAP4',
  61857,
  61991,
  'g gga gga ccc act ggt gga ggg ctg cac tgc cca tga cct gga ctc ggt ggc cgg ggt gct gaa gct cta ctt ccg gag cct gga gcc ccc act ctt ccc ccc aga cct gtt cgg cga gct gct ggc ttc ttc gg'),
 ('U52112.4-024',
  'ARHGAP4',
  61857,
  61991,
  'g gga gga ccc act ggt gga ggg ctg cac tgc cca tga cct gga ctc ggt ggc cgg ggt gct gaa gct cta ctt ccg gag cct gga gcc ccc act ctt ccc ccc aga cct gtt cgg cga gct gct ggc ttc ttc gg'),
 ('U52112.4-001',
  'ARHGAP4',
  61663,
  61768,
  'a gct gga ggc cac agc gga gag ggt gga gca cgt gag ccg cct gct gtg gcg gct gcc cgc gcc ggt gct ggt ggt tct gcg cta cct ctt cac ctt cct caa cca'),
 ('U52112.4-011',
  'ARHGAP4',
  61663,
  61768,
  'a gct gga ggc cac agc gga gag ggt gga gca cgt gag ccg cct gct gtg gcg gct gcc cgc gcc ggt gct ggt ggt tct gcg cta cct ctt cac ctt cct caa cca'),
 ('U52112.4-003',
  'ARHGAP4',
  61663,
  61768,
  'a gct gga ggc cac agc gga gag ggt gga gca cgt gag ccg cct gct gtg gcg gct gcc cgc gcc ggt gct ggt ggt tct gcg cta cct ctt cac ctt cct caa cca'),
 ('U52112.4-010',
  'ARHGAP4',
  61663,
  61768,
  'a gct gga ggc cac agc gga gag ggt gga gca cgt gag ccg cct gct gtg gcg gct gcc cgc gcc ggt gct ggt ggt tct gcg cta cct ctt cac ctt cct caa cca'),
 ('U52112.4-024',
  'ARHGAP4',
  61663,
  61768,
  'a gct gga ggc cac agc gga gag ggt gga gca cgt gag ccg cct gct gtg gcg gct gcc cgc gcc ggt gct ggt ggt tct gcg cta cct ctt cac ctt cct caa cca'),
 ('U52112.4-001',
  'ARHGAP4',
  61328,
  61561,
  'cc tgg ccc agt aca gcg atg aga aca tga tgg acc cct aca acc tgg ccg tgt gct tcg ggc cca cgc tgc tac cgg tgc ccg ctg ggc agg acc cgg tgg cgc tgc agg gcc ggg tga acc agc tgg tgc aga cgc tca tag tgc agc ccg atc ggg tct tcc cgc ccc tga cct cgc tgc ctg gcc ccg tct acg aga agt gca tgg cac cgc ctt ccg cca gct gcc tgg g'),
 ('U52112.4-011',
  'ARHGAP4',
  61328,
  61561,
  'cc tgg ccc agt aca gcg atg aga aca tga tgg acc cct aca acc tgg ccg tgt gct tcg ggc cca cgc tgc tac cgg tgc ccg ctg ggc agg acc cgg tgg cgc tgc agg gcc ggg tga acc agc tgg tgc aga cgc tca tag tgc agc ccg atc ggg tct tcc cgc ccc tga cct cgc tgc ctg gcc ccg tct acg aga agt gca tgg cac cgc ctt ccg cca gct gcc tgg g'),
 ('U52112.4-003',
  'ARHGAP4',
  61328,
  61561,
  'cc tgg ccc agt aca gcg atg aga aca tga tgg acc cct aca acc tgg ccg tgt gct tcg ggc cca cgc tgc tac cgg tgc ccg ctg ggc agg acc cgg tgg cgc tgc agg gcc ggg tga acc agc tgg tgc aga cgc tca tag tgc agc ccg atc ggg tct tcc cgc ccc tga cct cgc tgc ctg gcc ccg tct acg aga agt gca tgg cac cgc ctt ccg cca gct gcc tgg g'),
 ('U52112.4-010',
  'ARHGAP4',
  61328,
  61561,
  'cc tgg ccc agt aca gcg atg aga aca tga tgg acc cct aca acc tgg ccg tgt gct tcg ggc cca cgc tgc tac cgg tgc ccg ctg ggc agg acc cgg tgg cgc tgc agg gcc ggg tga acc agc tgg tgc aga cgc tca tag tgc agc ccg atc ggg tct tcc cgc ccc tga cct cgc tgc ctg gcc ccg tct acg aga agt gca tgg cac cgc ctt ccg cca gct gcc tgg g'),
 ('U52112.4-024',
  'ARHGAP4',
  61328,
  61561,
  'cc tgg ccc agt aca gcg atg aga aca tga tgg acc cct aca acc tgg ccg tgt gct tcg ggc cca cgc tgc tac cgg tgc ccg ctg ggc agg acc cgg tgg cgc tgc agg gcc ggg tga acc agc tgg tgc aga cgc tca tag tgc agc ccg atc ggg tct tcc cgc ccc tga cct cgc tgc ctg gcc ccg tct acg aga agt gca tgg cac cgc ctt ccg cca gct gcc tgg g'),
 ('U52112.4-001',
  'ARHGAP4',
  61169,
  61242,
  'gg acg ccc agc tgg aga gcc tgg ggg cgg aca atg agc cgg agc tgg aag ccg aga tgc ccg cac agg agg atg'),
 ('U52112.4-011',
  'ARHGAP4',
  61169,
  61242,
  'gg acg ccc agc tgg aga gcc tgg ggg cgg aca atg agc cgg agc tgg aag ccg aga tgc ccg cac agg agg atg'),
 ('U52112.4-003',
  'ARHGAP4',
  61169,
  61242,
  'gg acg ccc agc tgg aga gcc tgg ggg cgg aca atg agc cgg agc tgg aag ccg aga tgc ccg cac agg agg atg'),
 ('U52112.4-010',
  'ARHGAP4',
  61169,
  61242,
  'gg acg ccc agc tgg aga gcc tgg ggg cgg aca atg agc cgg agc tgg aag ccg aga tgc ccg cac agg agg atg'),
 ('U52112.4-024',
  'ARHGAP4',
  61169,
  61242,
  'gg acg ccc agc tgg aga gcc tgg ggg cgg aca atg agc cgg agc tgg aag ccg aga tgc ccg cac agg agg atg'),
 ('U52112.4-001',
  'ARHGAP4',
  60898,
  61081,
  'a cct gga ggg ggt cgt gga ggc tgt ggc ctg ctt tgc cta cac ggg ccg cac agc cca gga gct gag ctt ccg gcg ggg gga cgt act gcg gct gca cga gag ggc ctc gag cga ctg gtg gcg ggg gga gca caa cgg cat gcg ggg cct cat ccc cca caa gta tat cac gct gcc cgc cgg'),
 ('U52112.4-011',
  'ARHGAP4',
  60898,
  61081,
  'a cct gga ggg ggt cgt gga ggc tgt ggc ctg ctt tgc cta cac ggg ccg cac agc cca gga gct gag ctt ccg gcg ggg gga cgt act gcg gct gca cga gag ggc ctc gag cga ctg gtg gcg ggg gga gca caa cgg cat gcg ggg cct cat ccc cca caa gta tat cac gct gcc cgc cgg'),
 ('U52112.4-003',
  'ARHGAP4',
  60898,
  61081,
  'a cct gga ggg ggt cgt gga ggc tgt ggc ctg ctt tgc cta cac ggg ccg cac agc cca gga gct gag ctt ccg gcg ggg gga cgt act gcg gct gca cga gag ggc ctc gag cga ctg gtg gcg ggg gga gca caa cgg cat gcg ggg cct cat ccc cca caa gta tat cac gct gcc cgc cgg'),
 ('U52112.4-010',
  'ARHGAP4',
  60898,
  61081,
  'a cct gga ggg ggt cgt gga ggc tgt ggc ctg ctt tgc cta cac ggg ccg cac agc cca gga gct gag ctt ccg gcg ggg gga cgt act gcg gct gca cga gag ggc ctc gag cga ctg gtg gcg ggg gga gca caa cgg cat gcg ggg cct cat ccc cca caa gta tat cac gct gcc cgc cgg'),
 ('U52112.4-024',
  'ARHGAP4',
  60898,
  61081,
  'a cct gga ggg ggt cgt gga ggc tgt ggc ctg ctt tgc cta cac ggg ccg cac agc cca gga gct gag ctt ccg gcg ggg gga cgt act gcg gct gca cga gag ggc ctc gag cga ctg gtg gcg ggg gga gca caa cgg cat gcg ggg cct cat ccc cca caa gta tat cac gct gcc cgc cgg'),
 ('U52112.4-020',
  'ARHGAP4',
  72521,
  72560,
  'ga gtc aca ttg cag agg acg tgg ggc gcc tgg tca aga ag'),
 ('U52112.4-020',
  'ARHGAP4',
  71783,
  71965,
  'gcc aag aag acg tac cag gca tat cac atg gag agc gtg aat gcc gag gcc aag ctc cgg gag gcc gag cgg cag gag gag aag cgg gca ggc cgg agt gtc ccc acc acc acc gct ggt gcc act gag gca ggg ccc ctc cgc aag agc tcc ctc aag aag gga ggg agg ctg gtg gag aag'),
 ('U52112.4-001',
  'ARHGAP4',
  71783,
  71965,
  'gcc aag aag acg tac cag gca tat cac atg gag agc gtg aat gcc gag gcc aag ctc cgg gag gcc gag cgg cag gag gag aag cgg gca ggc cgg agt gtc ccc acc acc acc gct ggt gcc act gag gca ggg ccc ctc cgc aag agc tcc ctc aag aag gga ggg agg ctg gtg gag aag'),
 ('U52112.4-011',
  'ARHGAP4',
  71783,
  71965,
  'gcc aag aag acg tac cag gca tat cac atg gag agc gtg aat gcc gag gcc aag ctc cgg gag gcc gag cgg cag gag gag aag cgg gca ggc cgg agt gtc ccc acc acc acc gct ggt gcc act gag gca ggg ccc ctc cgc aag agc tcc ctc aag aag gga ggg agg ctg gtg gag aag'),
 ('U52112.4-003',
  'ARHGAP4',
  71783,
  71965,
  'gcc aag aag acg tac cag gca tat cac atg gag agc gtg aat gcc gag gcc aag ctc cgg gag gcc gag cgg cag gag gag aag cgg gca ggc cgg agt gtc ccc acc acc acc gct ggt gcc act gag gca ggg ccc ctc cgc aag agc tcc ctc aag aag gga ggg agg ctg gtg gag aag'),
 ('U52112.4-001',
  'ARHGAP4',
  70724,
  70843,
  'ctc tgg ccc ccg cag agg cct gtg gcc gct tcc agc tgt gca cct gtg tgc tgg ctc caa gct ggg ttt ctc gtg cac cct cca tgg tgg ggt gcc atg tgc gca cct tcc act cat cag'),
 ('U52112.4-020',
  'ARHGAP4',
  70724,
  70843,
  'ctc tgg ccc ccg cag agg cct gtg gcc gct tcc agc tgt gca cct gtg tgc tgg ctc caa gct ggg ttt ctc gtg cac cct cca tgg tgg ggt gcc atg tgc gca cct tcc act cat cag'),
 ('U52112.4-020',
  'ARHGAP4',
  70097,
  70210,
  'tgc tgt gac aca ggg ttc cac ctg gcc ctg ggg cag gtg ctc cgg agc tac acg gcc gct gag agc cgc acc caa gcc tcc caa gtg cag ggc ctg ggc agc ctg gaa gaa gct'),
 ('U52112.4-001',
  'ARHGAP4',
  72761,
  72965,
  'a gat gcg ctg gca gct gag cga gca gct gcg ctg cct gga gct gca ggg cga gct gcg gcg gga gtt gct gca gga gct ggc aga gtt cat gcg gcg ccg cgc tga ggt gga gct gga ata ctc ccg ggg cct gga aaa gct ggc cga gcg ctt ctc cag ccg tgg agg ccg cct ggg gag cag ccg gga gca cca aag ctt ccg'),
 ('U52112.4-011',
  'ARHGAP4',
  72761,
  72965,
  'a gat gcg ctg gca gct gag cga gca gct gcg ctg cct gga gct gca ggg cga gct gcg gcg gga gtt gct gca gga gct ggc aga gtt cat gcg gcg ccg cgc tga ggt gga gct gga ata ctc ccg ggg cct gga aaa gct ggc cga gcg ctt ctc cag ccg tgg agg ccg cct ggg gag cag ccg gga gca cca aag ctt ccg'),
 ('U52112.4-003',
  'ARHGAP4',
  72761,
  72965,
  'a gat gcg ctg gca gct gag cga gca gct gcg ctg cct gga gct gca ggg cga gct gcg gcg gga gtt gct gca gga gct ggc aga gtt cat gcg gcg ccg cgc tga ggt gga gct gga ata ctc ccg ggg cct gga aaa gct ggc cga gcg ctt ctc cag ccg tgg agg ccg cct ggg gag cag ccg gga gca cca aag ctt ccg'),
 ('U52112.4-024',
  'ARHGAP4',
  72761,
  72965,
  'a gat gcg ctg gca gct gag cga gca gct gcg ctg cct gga gct gca ggg cga gct gcg gcg gga gtt gct gca gga gct ggc aga gtt cat gcg gcg ccg cgc tga ggt gga gct gga ata ctc ccg ggg cct gga aaa gct ggc cga gcg ctt ctc cag ccg tgg agg ccg cct ggg gag cag ccg gga gca cca aag ctt ccg'),
 ('U52112.4-001',
  'ARHGAP4',
  64566,
  64757,
  'gtg aac aag act ctg aag gcg aca ctg cag gcc ctg ctg gag gtg gtg gcc tcg gat gac ggg gat gtg ctt gat tcc ttc cag acc agc ccc tcc acc gag tcc ctc aag tcc acc agc tca gac cca ggc agc cgg cag gcg ggc cgg agg cgc ggc cag cag cag gag acc gaa acc ttc tac ctc acg'),
 ('U52112.4-011',
  'ARHGAP4',
  64566,
  64757,
  'gtg aac aag act ctg aag gcg aca ctg cag gcc ctg ctg gag gtg gtg gcc tcg gat gac ggg gat gtg ctt gat tcc ttc cag acc agc ccc tcc acc gag tcc ctc aag tcc acc agc tca gac cca ggc agc cgg cag gcg ggc cgg agg cgc ggc cag cag cag gag acc gaa acc ttc tac ctc acg'),
 ('U52112.4-003',
  'ARHGAP4',
  64566,
  64757,
  'gtg aac aag act ctg aag gcg aca ctg cag gcc ctg ctg gag gtg gtg gcc tcg gat gac ggg gat gtg ctt gat tcc ttc cag acc agc ccc tcc acc gag tcc ctc aag tcc acc agc tca gac cca ggc agc cgg cag gcg ggc cgg agg cgc ggc cag cag cag gag acc gaa acc ttc tac ctc acg'),
 ('U52112.4-024',
  'ARHGAP4',
  64566,
  64757,
  'gtg aac aag act ctg aag gcg aca ctg cag gcc ctg ctg gag gtg gtg gcc tcg gat gac ggg gat gtg ctt gat tcc ttc cag acc agc ccc tcc acc gag tcc ctc aag tcc acc agc tca gac cca ggc agc cgg cag gcg ggc cgg agg cgc ggc cag cag cag gag acc gaa acc ttc tac ctc acg'),
 ('U52112.4-001',
  'ARHGAP4',
  64375,
  64459,
  'aag ctc cag gag tat ctg agt gga cgg agc atc ctc gcc aag ctg cag gcc aag cac gag aag ctg cag gag gcc ctt cag cga g'),
 ('U52112.4-011',
  'ARHGAP4',
  64375,
  64459,
  'aag ctc cag gag tat ctg agt gga cgg agc atc ctc gcc aag ctg cag gcc aag cac gag aag ctg cag gag gcc ctt cag cga g'),
 ('U52112.4-003',
  'ARHGAP4',
  64375,
  64459,
  'aag ctc cag gag tat ctg agt gga cgg agc atc ctc gcc aag ctg cag gcc aag cac gag aag ctg cag gag gcc ctt cag cga g'),
 ('U52112.4-024',
  'ARHGAP4',
  64375,
  64459,
  'aag ctc cag gag tat ctg agt gga cgg agc atc ctc gcc aag ctg cag gcc aag cac gag aag ctg cag gag gcc ctt cag cga g'),
 ('U52112.4-001',
  'ARHGAP4',
  64181,
  64208,
  'g tga caa gga gga gca gga ggt gtc ttg'),
 ('U52112.4-011',
  'ARHGAP4',
  64181,
  64208,
  'g tga caa gga gga gca gga ggt gtc ttg'),
 ('U52112.4-003',
  'ARHGAP4',
  64181,
  64208,
  'g tga caa gga gga gca gga ggt gtc ttg'),
 ('U52112.4-024',
  'ARHGAP4',
  64181,
  64208,
  'g tga caa gga gga gca gga ggt gtc ttg'),
 ('U52112.4-001',
  'ARHGAP4',
  63857,
  63959,
  'ga ccc agt aca cac aga gaa aat tcc aga aga gcc gcc agc ccc gcc cca gct ccc agt ata acc aga gac tct ttg ggg gag aca tgg aga agt tta tcc ag'),
 ('U52112.4-011',
  'ARHGAP4',
  63857,
  63959,
  'ga ccc agt aca cac aga gaa aat tcc aga aga gcc gcc agc ccc gcc cca gct ccc agt ata acc aga gac tct ttg ggg gag aca tgg aga agt tta tcc ag'),
 ('U52112.4-003',
  'ARHGAP4',
  63857,
  63959,
  'ga ccc agt aca cac aga gaa aat tcc aga aga gcc gcc agc ccc gcc cca gct ccc agt ata acc aga gac tct ttg ggg gag aca tgg aga agt tta tcc ag'),
 ('U52112.4-024',
  'ARHGAP4',
  63857,
  63959,
  'ga ccc agt aca cac aga gaa aat tcc aga aga gcc gcc agc ccc gcc cca gct ccc agt ata acc aga gac tct ttg ggg gag aca tgg aga agt tta tcc ag'),
 ('U52112.4-001',
  'ARHGAP4',
  62079,
  62156,
  'g cct gca gca tga agg cat ctt ccg ggt atc ggg tgc cca gct ccg ggt ctc aga gat ccg tga tgc ctt cga gag ag'),
 ('U52112.4-011',
  'ARHGAP4',
  62079,
  62156,
  'g cct gca gca tga agg cat ctt ccg ggt atc ggg tgc cca gct ccg ggt ctc aga gat ccg tga tgc ctt cga gag ag'),
 ('U52112.4-003',
  'ARHGAP4',
  62079,
  62156,
  'g cct gca gca tga agg cat ctt ccg ggt atc ggg tgc cca gct ccg ggt ctc aga gat ccg tga tgc ctt cga gag ag'),
 ('U52112.4-024',
  'ARHGAP4',
  62079,
  62156,
  'g cct gca gca tga agg cat ctt ccg ggt atc ggg tgc cca gct ccg ggt ctc aga gat ccg tga tgc ctt cga gag ag'),
 ('U52112.4-001',
  'ARHGAP4',
  60600,
  60692,
  'ga cgg aga agc agg tgg tgg gcg cag ggc tgc aga ctg cag ggg agt ctg gga gca gtc ccg agg gcc tcc tgg cat cgg agc tgg tcc acc g'),
 ('U52112.4-011',
  'ARHGAP4',
  60600,
  60692,
  'ga cgg aga agc agg tgg tgg gcg cag ggc tgc aga ctg cag ggg agt ctg gga gca gtc ccg agg gcc tcc tgg cat cgg agc tgg tcc acc g'),
 ('U52112.4-003',
  'ARHGAP4',
  60600,
  60692,
  'ga cgg aga agc agg tgg tgg gcg cag ggc tgc aga ctg cag ggg agt ctg gga gca gtc ccg agg gcc tcc tgg cat cgg agc tgg tcc acc g'),
 ('U52112.4-024',
  'ARHGAP4',
  60600,
  60692,
  'ga cgg aga agc agg tgg tgg gcg cag ggc tgc aga ctg cag ggg agt ctg gga gca gtc ccg agg gcc tcc tgg cat cgg agc tgg tcc acc g'),
 ('U52112.4-001',
  'ARHGAP4',
  60227,
  60326,
  'gc cag agc cat gca cct cac ctg agg cca tgg gac cct ctg gac aca gac gac gct gct tgg tcc cag cct ccc cag agc aac acg tgg agg tgg ata ag'),
 ('U52112.4-011',
  'ARHGAP4',
  60227,
  60326,
  'gc cag agc cat gca cct cac ctg agg cca tgg gac cct ctg gac aca gac gac gct gct tgg tcc cag cct ccc cag agc aac acg tgg agg tgg ata ag'),
 ('U52112.4-003',
  'ARHGAP4',
  60227,
  60326,
  'gc cag agc cat gca cct cac ctg agg cca tgg gac cct ctg gac aca gac gac gct gct tgg tcc cag cct ccc cag agc aac acg tgg agg tgg ata ag'),
 ('U52112.4-024',
  'ARHGAP4',
  60227,
  60326,
  'gc cag agc cat gca cct cac ctg agg cca tgg gac cct ctg gac aca gac gac gct gct tgg tcc cag cct ccc cag agc aac acg tgg agg tgg ata ag'),
 ('U52112.2-001',
  'AVPR2',
  56689,
  57573,
  'c tgt gcc tgg gca tcc ctc tct gcc cag cct gcc cag caa cag cag cca gga gag gcc act gga cac ccg gga ccc gct gct agc ccg ggc gga gct ggc gct gct ctc cat agt ctt tgt ggc tgt ggc cct gag caa tgg cct ggt gct ggc ggc cct agc tcg gcg ggg ccg gcg ggg cca ctg ggc acc cat aca cgt ctt cat tgg cca ctt gtg cct ggc cga cct ggc cgt ggc tct gtt cca agt gct gcc cca gct ggc ctg gaa ggc cac cga ccg ctt ccg tgg gcc aga tgc cct gtg tcg ggc cgt gaa gta tct gca gat ggt ggg cat gta tgc ctc ctc cta cat gat cct ggc cat gac gct gga ccg cca ccg tgc cat ctg ccg tcc cat gct ggc gta ccg cca tgg aag tgg ggc tca ctg gaa ccg gcc ggt gct agt ggc ttg ggc ctt ctc gct cct tct cag cct gcc cca gct ctt cat ctt cgc cca gcg caa cgt gga agg tgg cag cgg ggt cac tga ctg ctg ggc ctg ctt tgc gga gcc ctg ggg ccg tcg cac cta tgt cac ctg gat tgc cct gat ggt gtt cgt ggc acc tac cct ggg tat cgc cgc ctg cca ggt gct cat ctt ccg gga gat tca tgc cag tct ggt gcc agg gcc atc aga gag gcc tgg ggg gcg ccg cag ggg acg ccg gac agg cag ccc cgg tga ggg agc cca cgt gtc agc agc tgt ggc caa gac tgt gag gat gac gct agt gat tgt ggt cgt cta tgt gct gtg ctg ggc acc ctt ctt cct ggt gca gct gtg ggc cgc gtg gga ccc gga ggc acc tct gga ag')]

In [ ]: