Esercizio 4

EMBL (http://www.ebi.ac.uk/cgi-bin/sva/sva.pl/) è una banca di sequenze nucleotidiche sviluppata da EMBL-EBI (European Bioinformatics Institute, European Molecular Biology Laboratory), in cui ogni sequenza nucleotidica viene memorizzata, con altre informazioni, in file di testo (entry EMBL) in un formato che prende il nome di formato EMBL.

Il formato EMBL è composto da record che iniziano con un codice a due lettere maisucole che specifica il contenuto del record. I soli record che non iniziano con il codice a due lettere sono quelli contenenti la sequenza nucleotidica.

Dato un file in formato EMBL, contenente la sequenza nucleotidica (sequenza di basi) di un mRNA (trascritto espresso da un gene), produrre:

  • la sequenza (nucleotidica) della CDS in formato FASTA
  • la distribuzione di frequenza dei codoni (stop codon escluso), elencati per frequenza decrescente
  • la distribuzione di frequenza degli amminoacidi della proteina, elencati per frequenza descrescente

Inoltre, preso in input un file contenente il codice genetico, effettuare la validazione della proteina contenuta nel file EMBL rispetto alla traduzione della sequenza della CDS tramite il codice genetico.


Parametri in input:

  • nome del file in formato EMBL
  • nome del file del codice genetico

Il file del codice genetico è strutturato in record di campi separati da ,, in cui il primo campo è il simbolo di un amminoacido e gli altri campi sono i codoni che codificano quell'amminoacido.

A,gct,gcc,gca,gcg
C,tgt,tgc
D,gat,gac
E,gaa,gag
F,ttt,ttc
G,ggt,ggc,gga,ggg
H,cat,cac
I,att,atc,ata
K,aaa,aag
L,tta,ttg,ctt,ctc,cta,ctg
M,atg
N,aat,aac
P,cct,ccc,cca,ccg
Q,caa,cag
R,cgt,cgc,cga,cgg,aga,agg
S,tct,tcc,tca,tcg,agt,agc
T,act,acc,aca,acg
V,gtt,gtc,gta,gtg
W,tgg
Y,tat,tac
s,tga,taa,tag


Dove trovare le informazioni che servono per risolvere l'esercizio:

  • Il record che inizia con ID

     ID   M10051; SV 1; linear; mRNA; STD; HUM; 4723 BP.

contiene l'identificatore univoco della sequenza (M10051) e l'organismo (HUM). Il fatto che il file si riferisca alla sequenza nucleotidica di un gene è indicato dalla presenza della parola mRNA.

  • Il record

     FT   CDS             139..4287

contiene lo start e l'end (1-based) della coding sequence (CDS) sulla sequenza nucleotidica.

  • L'insieme dei record che iniziano con FT sono quelli che contengono le features della sequenza nucleotidica. In particolare tutti i record della sezione:

     FT                   /translation="MGTGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPGMDIRNNLT
     FT                   RLHELENCSVIEGHLQILLMFKTRPEDFRDLSFPKLIMITDYLLLFRVYGLESLKDLFP
     FT                   NLTVIRGSRLFFNYALVIFEMVHLKELGLYNLMNITRGSVRIEKNNELCYLATIDWSRI
     FT                   LDSVEDNHIVLNKDDNEECGDICPGTAKGKTNCPATVINGQFVERCWTHSHCQKVCPTI
     FT                   [...]
     FT                   DGGSSLGFKRSYEEHIPYTHMNGGKKNGRILTLPRSNPS"

contengono la sequenza della proteina espressa dal gene.

  • Il record che inizia con SQ:

      SQ   Sequence 4723 BP; 1068 A; 1298 C; 1311 G; 1046 T; 0 other;

introduce la sezione della sequenza nucleotidica che termina con il record // (file del file). Ogni record contenente la sequenza nucleotidica inizia con una serie di spazi iniziali, e contiene un chunk di sequenza lungo 60 basi. L'intero alla fine del record fornisce la lunghezza totale dei chunks fino a tale record. Ogni chunk in un record viene poi separato in chunks più piccoli di 10 basi.

SQ   Sequence 4723 BP; 1068 A; 1298 C; 1311 G; 1046 T; 0 other;
     ggggggctgc gcggccgggt cggtgcgcac acgagaagga cgcgcggccc ccagcgctct        60
     tgggggccgc ctcggagcat gacccccgcg ggccagcgcc gcgcgcctga tccgaggaga       120
     ccccgcgctc ccgcagccat gggcaccggg ggccggcggg gggcggcggc cgcgccgctg       180
     ctggtggcgg tggccgcgct gctactgggc gccgcgggcc acctgtaccc cggagaggtg       240
     tgtcccggca tggatatccg gaacaacctc actaggttgc atgagctgga gaattgctct       300
     gtcatcgaag gacacttgca gatactcttg atgttcaaaa cgaggcccga agatttccga       360
     gacctcagtt tccccaaact catcatgatc actgattact tgctgctctt ccgggtctat       420
     gggctcgaga gcctgaagga cctgttcccc aacctcacgg tcatccgggg atcacgactg       480
     [...]
     tttttcgttc cccccacccg cccccagcag atggaaagaa agcacctgtt tttacaaatt      4620
     cttttttttt tttttttttt tttttttttg ctggtgtctg agcttcagta taaaagacaa      4680
     aacttcctgt ttgtggaaca aaatttcgaa agaaaaaacc aaa                        4723
//

NOTA BENE:

  • l'alfabeto degli amminoacidi è {ACDEFGHIKLMNPQRSTVWY}
  • la sequenza nucleotidica riportata nell'entry EMBL è su alfabeto {a,c,g,t} nonostante rappresenti la sequenza primaria di un mRNA. Per ottenere la sequenza su alfabeto {a,c,g,u} basta operare una sostituzione di tutti i simboli t con simboli u.

Requisiti:

  • nell’header della sequenza della CDS in formato FASTA devono comparire l’identificatore univoco della sequenza, l’organismo a cui si riferisce e la lunghezza, nel seguente formato:

      >M10051-HUM; len=4149
  • le sequenze in formato FASTA devono essere prodotte in righe di 80 caratteri

  • non effettuare la sostituzione da t a u.

  • deve essere definita una funzione format_fasta() che prenda come argomenti un'intestazione FASTA, una sequenza nucleotidica/proteica, e restituisca la sequenza in formato FASTA con la sequenza separata in righe di 80 caratteri.

  • usare solo espressioni regolari per estrarre le informazioni (tranne per estrarre le informazioni dal file del codice genetico)


Variabili di output:

  • cds_sequence_fasta: sequenza nucleotidica in formato FASTA
  • codon_frequency: lista di tuple (codon, frequency) elencate per frequenza decrescente
  • ammino_frequency: lista di tuple (amminoacid, frequency) elencate per frequenza decrescente

Soluzione

Definizione della funzione format_fasta()


In [1]:
def format_fasta(header, sequence):
    return header + '\n' + '\n'.join(re.findall('\w{1,80}', sequence))

NOTA BENE: supporre che l'header in input alla funzione non abbia il simbolo newline \n alla fine ma che abbia il simbolo > all'inizio.

Parametri in input


In [2]:
genetic_code_name = './genetic-code.txt'
input_file_name = './M10051.txt'

Importazione del modulo re per utilizzare le espressioni regolari (RE).


In [3]:
import re

Lettura del file del codice genetico in una lista di stringhe file_str_list


In [4]:
with open(genetic_code_name, 'r') as genetic_file:
    file_str_list = genetic_file.readlines()

In [5]:
file_str_list


Out[5]:
['A,gct,gcc,gca,gcg\n',
 'C,tgt,tgc\n',
 'D,gat,gac\n',
 'E,gaa,gag\n',
 'F,ttt,ttc\n',
 'G,ggt,ggc,gga,ggg\n',
 'H,cat,cac\n',
 'I,att,atc,ata\n',
 'K,aaa,aag\n',
 'L,tta,ttg,ctt,ctc,cta,ctg\n',
 'M,atg\n',
 'N,aat,aac\n',
 'P,cct,ccc,cca,ccg\n',
 'Q,caa,cag\n',
 'R,cgt,cgc,cga,cgg,aga,agg\n',
 'S,tct,tcc,tca,tcg,agt,agc\n',
 'T,act,acc,aca,acg\n',
 'V,gtt,gtc,gta,gtg\n',
 'W,tgg\n',
 'Y,tat,tac\n',
 's,tga,taa,tag']

Costruzione in genetic_code_dict del dizionario del codice genetico

Ottenere da file_str_list il dizionario genetic_code_dict in cui le chiavi sono i codoni e i valori sono il corrispondente amminoacido.

Costruire prima la lista di tuple (chiave, valore) e usare poi la funzione dict() per costruire il dizionario.


In [6]:
key_value_tuple_list = [(codon, row.rstrip().split(',')[0]) for row in file_str_list for codon in row.rstrip().split(',')[1:]]

In [7]:
key_value_tuple_list


Out[7]:
[('gct', 'A'),
 ('gcc', 'A'),
 ('gca', 'A'),
 ('gcg', 'A'),
 ('tgt', 'C'),
 ('tgc', 'C'),
 ('gat', 'D'),
 ('gac', 'D'),
 ('gaa', 'E'),
 ('gag', 'E'),
 ('ttt', 'F'),
 ('ttc', 'F'),
 ('ggt', 'G'),
 ('ggc', 'G'),
 ('gga', 'G'),
 ('ggg', 'G'),
 ('cat', 'H'),
 ('cac', 'H'),
 ('att', 'I'),
 ('atc', 'I'),
 ('ata', 'I'),
 ('aaa', 'K'),
 ('aag', 'K'),
 ('tta', 'L'),
 ('ttg', 'L'),
 ('ctt', 'L'),
 ('ctc', 'L'),
 ('cta', 'L'),
 ('ctg', 'L'),
 ('atg', 'M'),
 ('aat', 'N'),
 ('aac', 'N'),
 ('cct', 'P'),
 ('ccc', 'P'),
 ('cca', 'P'),
 ('ccg', 'P'),
 ('caa', 'Q'),
 ('cag', 'Q'),
 ('cgt', 'R'),
 ('cgc', 'R'),
 ('cga', 'R'),
 ('cgg', 'R'),
 ('aga', 'R'),
 ('agg', 'R'),
 ('tct', 'S'),
 ('tcc', 'S'),
 ('tca', 'S'),
 ('tcg', 'S'),
 ('agt', 'S'),
 ('agc', 'S'),
 ('act', 'T'),
 ('acc', 'T'),
 ('aca', 'T'),
 ('acg', 'T'),
 ('gtt', 'V'),
 ('gtc', 'V'),
 ('gta', 'V'),
 ('gtg', 'V'),
 ('tgg', 'W'),
 ('tat', 'Y'),
 ('tac', 'Y'),
 ('tga', 's'),
 ('taa', 's'),
 ('tag', 's')]

Costruire il dizionario con la funzione dict().


In [8]:
genetic_code_dict = dict(key_value_tuple_list)

In [9]:
genetic_code_dict


Out[9]:
{'gct': 'A',
 'gcc': 'A',
 'gca': 'A',
 'gcg': 'A',
 'tgt': 'C',
 'tgc': 'C',
 'gat': 'D',
 'gac': 'D',
 'gaa': 'E',
 'gag': 'E',
 'ttt': 'F',
 'ttc': 'F',
 'ggt': 'G',
 'ggc': 'G',
 'gga': 'G',
 'ggg': 'G',
 'cat': 'H',
 'cac': 'H',
 'att': 'I',
 'atc': 'I',
 'ata': 'I',
 'aaa': 'K',
 'aag': 'K',
 'tta': 'L',
 'ttg': 'L',
 'ctt': 'L',
 'ctc': 'L',
 'cta': 'L',
 'ctg': 'L',
 'atg': 'M',
 'aat': 'N',
 'aac': 'N',
 'cct': 'P',
 'ccc': 'P',
 'cca': 'P',
 'ccg': 'P',
 'caa': 'Q',
 'cag': 'Q',
 'cgt': 'R',
 'cgc': 'R',
 'cga': 'R',
 'cgg': 'R',
 'aga': 'R',
 'agg': 'R',
 'tct': 'S',
 'tcc': 'S',
 'tca': 'S',
 'tcg': 'S',
 'agt': 'S',
 'agc': 'S',
 'act': 'T',
 'acc': 'T',
 'aca': 'T',
 'acg': 'T',
 'gtt': 'V',
 'gtc': 'V',
 'gta': 'V',
 'gtg': 'V',
 'tgg': 'W',
 'tat': 'Y',
 'tac': 'Y',
 'tga': 's',
 'taa': 's',
 'tag': 's'}

Lettura del file (entry EMBL) in un'unica stringa file_str


In [10]:
with open(input_file_name,'r') as input_file:
    file_str = input_file.read()

In [11]:
file_str


Out[11]:
'ID   M10051; SV 1; linear; mRNA; STD; HUM; 4723 BP.\nXX\nAC   M10051;\nXX\nDT   02-JUL-1986 (Rel. 09, Created)\nDT   14-NOV-2006 (Rel. 89, Last updated, Version 7)\nXX\nDE   Human insulin receptor mRNA, complete cds.\nXX\nKW   insulin receptor; tyrosine kinase.\nXX\nOS   Homo sapiens (human)\nOC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;\nOC   Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae;\nOC   Homo.\nXX\nRN   [1]\nRP   1-4723\nRX   DOI; 10.1016/0092-8674(85)90334-4.\nRX   PUBMED; 2859121.\nRA   Ebina Y., Ellis L., Jarnagin K., Edery M., Graf L., Clauser E., Ou J.-H.,\nRA   Masiarz F., Kan Y.W., Goldfine I.D., Roth R.A., Rutter W.J.;\nRT   "The human insulin receptor cDNA: the structural basis for\nRT   hormone-activated transmembrane signalling";\nRL   Cell 40(4):747-758(1985).\nXX\nDR   MD5; e4e6ebf2e723a500c1dd62385c279351.\nDR   Ensembl-Gn; ENSG00000171105; homo_sapiens.\nDR   Ensembl-Tr; ENST00000302850; homo_sapiens.\nDR   Ensembl-Tr; ENST00000341500; homo_sapiens.\nDR   EuropePMC; PMC2739203; 19682364.\nDR   EuropePMC; PMC3164640; 21909271.\nDR   EuropePMC; PMC452597; 15146055.\nXX\nCC   [1] suggests that the insulin receptor may be the cellular homolog\nCC   of the v-ros transforming (oncogene) protein.  [1] notes\nCC   similarities between the insulin receptor and several growth factor\nCC   receptors and oncogenes.  Insulin receptor is a heterodimer\nCC   consisting of 2 alpha and 2 beta subunits.  Beta-prime may be a\nCC   cleavage product produced upon binding of insulin.  [1] suggests\nCC   that translation may begin at the \'atg\' start codon at positions\nCC   79-81 with protein cleavage occurring after position 120 to yield\nCC   the signal peptide.  [1] gives illustrations of the various domains\nCC   present in the protein.  A draft entry and sequence for [1] in\nCC   computer-readable form were kindly provided by K. Jarnagin\nCC   (30-JUL-1985).\nXX\nFH   Key             Location/Qualifiers\nFH\nFT   source          1..4723\nFT                   /organism="Homo sapiens"\nFT                   /map="19p13.3-p13.2"\nFT                   /mol_type="mRNA"\nFT                   /db_xref="taxon:9606"\nFT   sig_peptide     137..219\nFT                   /note="insulin receptor signal peptide"\nFT   CDS             139..4287\nFT                   /codon_start=1\nFT                   /gene="INSR"\nFT                   /note="insulin receptor precursor"\nFT                   /db_xref="GOA:P06213"\nFT                   /db_xref="H-InvDB:HIT000194074.15"\nFT                   /db_xref="HGNC:HGNC:6091"\nFT                   /db_xref="InterPro:IPR000494"\nFT                   /db_xref="InterPro:IPR000719"\nFT                   /db_xref="InterPro:IPR001245"\nFT                   /db_xref="InterPro:IPR002011"\nFT                   /db_xref="InterPro:IPR003961"\nFT                   /db_xref="InterPro:IPR006211"\nFT                   /db_xref="InterPro:IPR006212"\nFT                   /db_xref="InterPro:IPR008266"\nFT                   /db_xref="InterPro:IPR009030"\nFT                   /db_xref="InterPro:IPR011009"\nFT                   /db_xref="InterPro:IPR013783"\nFT                   /db_xref="InterPro:IPR016246"\nFT                   /db_xref="InterPro:IPR017441"\nFT                   /db_xref="InterPro:IPR020635"\nFT                   /db_xref="InterPro:IPR032675"\nFT                   /db_xref="PDB:1GAG"\nFT                   /db_xref="PDB:1I44"\nFT                   /db_xref="PDB:1IR3"\nFT                   /db_xref="PDB:1IRK"\nFT                   /db_xref="PDB:1P14"\nFT                   /db_xref="PDB:1RQQ"\nFT                   /db_xref="PDB:2AUH"\nFT                   /db_xref="PDB:2B4S"\nFT                   /db_xref="PDB:2HR7"\nFT                   /db_xref="PDB:2MFR"\nFT                   /db_xref="PDB:2Z8C"\nFT                   /db_xref="PDB:3BU3"\nFT                   /db_xref="PDB:3BU5"\nFT                   /db_xref="PDB:3BU6"\nFT                   /db_xref="PDB:3EKK"\nFT                   /db_xref="PDB:3EKN"\nFT                   /db_xref="PDB:3ETA"\nFT                   /db_xref="PDB:3W11"\nFT                   /db_xref="PDB:3W12"\nFT                   /db_xref="PDB:3W13"\nFT                   /db_xref="PDB:3W14"\nFT                   /db_xref="PDB:4IBM"\nFT                   /db_xref="PDB:4OGA"\nFT                   /db_xref="PDB:4XLV"\nFT                   /db_xref="PDB:4XSS"\nFT                   /db_xref="PDB:4XST"\nFT                   /db_xref="PDB:4ZXB"\nFT                   /db_xref="PDB:5E1S"\nFT                   /db_xref="PDB:5HHW"\nFT                   /db_xref="UniProtKB/Swiss-Prot:P06213"\nFT                   /protein_id="AAA59174.1"\nFT                   /translation="MGTGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPGMDIRNNLT\nFT                   RLHELENCSVIEGHLQILLMFKTRPEDFRDLSFPKLIMITDYLLLFRVYGLESLKDLFP\nFT                   NLTVIRGSRLFFNYALVIFEMVHLKELGLYNLMNITRGSVRIEKNNELCYLATIDWSRI\nFT                   LDSVEDNHIVLNKDDNEECGDICPGTAKGKTNCPATVINGQFVERCWTHSHCQKVCPTI\nFT                   CKSHGCTAEGLCCHSECLGNCSQPDDPTKCVACRNFYLDGRCVETCPPPYYHFQDWRCV\nFT                   NFSFCQDLHHKCKNSRRQGCHQYVIHNNKCIPECPSGYTMNSSNLLCTPCLGPCPKVCH\nFT                   LLEGEKTIDSVTSAQELRGCTVINGSLIINIRGGNNLAAELEANLGLIEEISGYLKIRR\nFT                   SYALVSLSFFRKLRLIRGETLEIGNYSFYALDNQNLRQLWDWSKHNLTTTQGKLFFHYN\nFT                   PKLCLSEIHKMEEVSGTKGRQERNDIALKTNGDKASCENELLKFSYIRTSFDKILLRWE\nFT                   PYWPPDFRDLLGFMLFYKEAPYQNVTEFDGQDACGSNSWTVVDIDPPLRSNDPKSQNHP\nFT                   GWLMRGLKPWTQYAIFVKTLVTFSDERRTYGAKSDIIYVQTDATNPSVPLDPISVSNSS\nFT                   SQIILKWKPPSDPNGNITHYLVFWERQAEDSELFELDYCLKGLKLPSRTWSPPFESEDS\nFT                   QKHNQSEYEDSAGECCSCPKTDSQILKELEESSFRKTFEDYLHNVVFVPRKTSSGTGAE\nFT                   DPRPSRKRRSLGDVGNVTVAVPTVAAFPNTSSTSVPTSPEEHRPFEKVVNKESLVISGL\nFT                   RHFTGYRIELQACNQDTPEERCSVAAYVSARTMPEAKADDIVGPVTHEIFENNVVHLMW\nFT                   QEPKEPNGLIVLYEVSYRRYGDEELHLCVSRKHFALERGCRLRGLSPGNYSVRIRATSL\nFT                   AGNGSWTEPTYFYVTDYLDVPSNIAKIIIGPLIFVFLFSVVIGSIYLFLRKRQPDGPLG\nFT                   PLYASSNPEYLSASDVFPCSVYVPDEWEVSREKITLLRELGQGSFGMVYEGNARDIIKG\nFT                   EAETRVAVKTVNESASLRERIEFLNEASVMKGFTCHHVVRLLGVVSKGQPTLVVMELMA\nFT                   HGDLKSYLRSLRPEAENNPGRPPPTLQEMIQMAAEIADGMAYLNAKKFVHRDLAARNCM\nFT                   VAHDFTVKIGDFGMTRDIYETDYYRKGGKGLLPVRWMAPESLKDGVFTTSSDMWSFGVV\nFT                   LWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDNCPERVTDLMRMCWQFNPKMRPTFL\nFT                   EIVNLLKDDLHPSFPEVSFFHSEENKAPESEELEMEFEDMENVPLDRSSHCQREEAGGR\nFT                   DGGSSLGFKRSYEEHIPYTHMNGGKKNGRILTLPRSNPS"\nFT   mat_peptide     220..2424\nFT                   /gene="INSR"\nFT                   /note="insulin receptor alpha subunit"\nFT   mat_peptide     2425..4284\nFT                   /gene="INSR"\nFT                   /note="insulin receptor beta subunit"\nFT   mat_peptide     2425..2469\nFT                   /partial\nFT                   /gene="INSR"\nFT                   /note="insulin receptor beta-prime subunit"\nXX\nSQ   Sequence 4723 BP; 1068 A; 1298 C; 1311 G; 1046 T; 0 other;\n     ggggggctgc gcggccgggt cggtgcgcac acgagaagga cgcgcggccc ccagcgctct        60\n     tgggggccgc ctcggagcat gacccccgcg ggccagcgcc gcgcgcctga tccgaggaga       120\n     ccccgcgctc ccgcagccat gggcaccggg ggccggcggg gggcggcggc cgcgccgctg       180\n     ctggtggcgg tggccgcgct gctactgggc gccgcgggcc acctgtaccc cggagaggtg       240\n     tgtcccggca tggatatccg gaacaacctc actaggttgc atgagctgga gaattgctct       300\n     gtcatcgaag gacacttgca gatactcttg atgttcaaaa cgaggcccga agatttccga       360\n     gacctcagtt tccccaaact catcatgatc actgattact tgctgctctt ccgggtctat       420\n     gggctcgaga gcctgaagga cctgttcccc aacctcacgg tcatccgggg atcacgactg       480\n     ttctttaact acgcgctggt catcttcgag atggttcacc tcaaggaact cggcctctac       540\n     aacctgatga acatcacccg gggttctgtc cgcatcgaga agaacaatga gctctgttac       600\n     ttggccacta tcgactggtc ccgtatcctg gattccgtgg aggataatca catcgtgttg       660\n     aacaaagatg acaacgagga gtgtggagac atctgtccgg gtaccgcgaa gggcaagacc       720\n     aactgccccg ccaccgtcat caacgggcag tttgtcgaac gatgttggac tcatagtcac       780\n     tgccagaaag tttgcccgac catctgtaag tcacacggct gcaccgccga aggcctctgt       840\n     tgccacagcg agtgcctggg caactgttct cagcccgacg accccaccaa gtgcgtggcc       900\n     tgccgcaact tctacctgga cggcaggtgt gtggagacct gcccgccccc gtactaccac       960\n     ttccaggact ggcgctgtgt gaacttcagc ttctgccagg acctgcacca caaatgcaag      1020\n     aactcgcgga ggcagggctg ccaccaatac gtcattcaca acaacaagtg catccctgag      1080\n     tgtccctccg ggtacacgat gaattccagc aacttgctgt gcaccccatg cctgggtccc      1140\n     tgtcccaagg tgtgccacct cctagaaggc gagaagacca tcgactcggt gacgtctgcc      1200\n     caggagctcc gaggatgcac cgtcatcaac gggagtctga tcatcaacat tcgaggaggc      1260\n     aacaatctgg cagctgagct agaagccaac ctcggcctca ttgaagaaat ttcagggtat      1320\n     ctaaaaatcc gccgatccta cgctctggtg tcactttcct tcttccggaa gttacgtctg      1380\n     attcgaggag agaccttgga aattgggaac tactccttct atgccttgga caaccagaac      1440\n     ctaaggcagc tctgggactg gagcaaacac aacctcacca ccactcaggg gaaactcttc      1500\n     ttccactata accccaaact ctgcttgtca gaaatccaca agatggaaga agtttcagga      1560\n     accaaggggc gccaggagag aaacgacatt gccctgaaga ccaatgggga caaggcatcc      1620\n     tgtgaaaatg agttacttaa attttcttac attcggacat cttttgacaa gatcttgctg      1680\n     agatgggagc cgtactggcc ccccgacttc cgagacctct tggggttcat gctgttctac      1740\n     aaagaggccc cttatcagaa tgtgacggag ttcgatgggc aggatgcgtg tggttccaac      1800\n     agttggacgg tggtagacat tgacccaccc ctgaggtcca acgaccccaa atcacagaac      1860\n     cacccagggt ggctgatgcg gggtctcaag ccctggaccc agtatgccat ctttgtgaag      1920\n     accctggtca ccttttcgga tgaacgccgg acctatgggg ccaagagtga catcatttat      1980\n     gtccagacag atgccaccaa cccctctgtg cccctggatc caatctcagt gtctaactca      2040\n     tcatcccaga ttattctgaa gtggaaacca ccctccgacc ccaatggcaa catcacccac      2100\n     tacctggttt tctgggagag gcaggcggaa gacagtgagc tgttcgagct ggattattgc      2160\n     ctcaaagggc tgaagctgcc ctcgaggacc tggtctccac cattcgagtc tgaagattct      2220\n     cagaagcaca accagagtga gtatgaggat tcggccggcg aatgctgctc ctgtccaaag      2280\n     acagactctc agatcctgaa ggagctggag gagtcctcgt ttaggaagac gtttgaggat      2340\n     tacctgcaca acgtggtttt cgtccccaga aaaacctctt caggcactgg tgccgaggac      2400\n     cctaggccat ctcggaaacg caggtccctt ggcgatgttg ggaatgtgac ggtggccgtg      2460\n     cccacggtgg cagctttccc caacacttcc tcgaccagcg tgcccacgag tccggaggag      2520\n     cacaggcctt ttgagaaggt ggtgaacaag gagtcgctgg tcatctccgg cttgcgacac      2580\n     ttcacgggct atcgcatcga gctgcaggct tgcaaccagg acacccctga ggaacggtgc      2640\n     agtgtggcag cctacgtcag tgcgaggacc atgcctgaag ccaaggctga tgacattgtt      2700\n     ggccctgtga cgcatgaaat ctttgagaac aacgtcgtcc acttgatgtg gcaggagccg      2760\n     aaggagccca atggtctgat cgtgctgtat gaagtgagtt atcggcgata tggtgatgag      2820\n     gagctgcatc tctgcgtctc ccgcaagcac ttcgctctgg aacggggctg caggctgcgt      2880\n     gggctgtcac cggggaacta cagcgtgcga atccgggcca cctcccttgc gggcaacggc      2940\n     tcttggacgg aacccaccta tttctacgtg acagactatt tagacgtccc gtcaaatatt      3000\n     gcaaaaatta tcatcggccc cctcatcttt gtctttctct tcagtgttgt gattggaagt      3060\n     atttatctat tcctgagaaa gaggcagcca gatgggccgc tgggaccgct ttacgcttct      3120\n     tcaaaccctg agtatctcag tgccagtgat gtgtttccat gctctgtgta cgtgccggac      3180\n     gagtgggagg tgtctcgaga gaagatcacc ctccttcgag agctggggca gggctccttc      3240\n     ggcatggtgt atgagggcaa tgccagggac atcatcaagg gtgaggcaga gacccgcgtg      3300\n     gcggtgaaga cggtcaacga gtcagccagt ctccgagagc ggattgagtt cctcaatgag      3360\n     gcctcggtca tgaagggctt cacctgccat cacgtggtgc gcctcctggg agtggtgtcc      3420\n     aagggccagc ccacgctggt ggtgatggag ctgatggctc acggagacct gaagagctac      3480\n     ctccgttctc tgcggccaga ggctgagaat aatcctggcc gccctccccc tacccttcaa      3540\n     gagatgattc agatggcggc agagattgct gacgggatgg cctacctgaa cgccaagaag      3600\n     tttgtgcatc gggacctggc agcgagaaac tgcatggtcg cccatgattt tactgtcaaa      3660\n     attggagact ttggaatgac cagagacatc tatgaaacgg attactaccg gaaagggggc      3720\n     aagggtctgc tccctgtacg gtggatggca ccggagtccc tgaaggatgg ggtcttcacc      3780\n     acttcttctg acatgtggtc ctttggcgtg gtcctttggg aaatcaccag cttggcagaa      3840\n     cagccttacc aaggcctgtc taatgaacag gtgttgaaat ttgtcatgga tggagggtat      3900\n     ctggatcaac ccgacaactg tccagagaga gtcactgacc tcatgcgcat gtgctggcaa      3960\n     ttcaacccca agatgaggcc aaccttcctg gagattgtca acctgctcaa ggacgacctg      4020\n     caccccagct ttccagaggt gtcgttcttc cacagcgagg agaacaaggc tcccgagagt      4080\n     gaggagctgg agatggagtt tgaggacatg gagaatgtgc ccctggaccg ttcctcgcac      4140\n     tgtcagaggg aggaggcggg gggccgggat ggagggtcct cgctgggttt caagcggagc      4200\n     tacgaggaac acatccctta cacacacatg aacggaggca agaaaaacgg gcggattctg      4260\n     accttgcctc ggtccaatcc ttcctaacag tgcctaccgt ggcgggggcg ggcaggggtt      4320\n     cccattttcg ctttcctctg gtttgaaagc ctctggaaaa ctcaggattc tcacgactct      4380\n     accatgtcca gtggagttca gagatcgttc ctatacattt ctgttcatct taaggtggac      4440\n     tcgtttggtt accaatttaa ctagtcctgc agaggattta actgtgaacc tggagggcaa      4500\n     ggggtttcca cagttgctgc tcctttgggg caacgacggt ttcaaaccag gattttgtgt      4560\n     tttttcgttc cccccacccg cccccagcag atggaaagaa agcacctgtt tttacaaatt      4620\n     cttttttttt tttttttttt tttttttttg ctggtgtctg agcttcagta taaaagacaa      4680\n     aacttcctgt ttgtggaaca aaatttcgaa agaaaaaacc aaa                        4723\n//\n'

Estrazione dell'identificatore univoco e dell'organismo dell'entry EMBL

Estrarre dal record ID l'identificatore univoco e l'organismo nelle variabili identifier e organism.

ID   M10051; SV 1; linear; mRNA; STD; HUM; 4723 BP.

Estrazione dell'organismo.


In [12]:
s = re.search('([\w\s]+;){5}\s+(\w+);', file_str, re.M)
organism = s.group(2)

In [13]:
organism


Out[13]:
'HUM'

Estrazione dell'identificatore.


In [14]:
s = re.search('^ID\s+(\w+);', file_str, re.M)
identifier = s.group(1)

In [15]:
identifier


Out[15]:
'M10051'

Estrazione della sequenza nucleotidica dell'mRNA

Estrarre nella lista seq_row_list i record della sequenza nucleotidica escludendo solo l'intero finale (mantenendo gli spazi iniziali e gli spazi prima dell'intero).

tgggggccgc ctcggagcat gacccccgcg ggccagcgcc gcgcgcctga tccgaggaga       120

In [16]:
seq_row_list = re.findall('^\W{2}(\D+)\d+', file_str, re.M)

In [17]:
seq_row_list


Out[17]:
['   ggggggctgc gcggccgggt cggtgcgcac acgagaagga cgcgcggccc ccagcgctct        ',
 '   tgggggccgc ctcggagcat gacccccgcg ggccagcgcc gcgcgcctga tccgaggaga       ',
 '   ccccgcgctc ccgcagccat gggcaccggg ggccggcggg gggcggcggc cgcgccgctg       ',
 '   ctggtggcgg tggccgcgct gctactgggc gccgcgggcc acctgtaccc cggagaggtg       ',
 '   tgtcccggca tggatatccg gaacaacctc actaggttgc atgagctgga gaattgctct       ',
 '   gtcatcgaag gacacttgca gatactcttg atgttcaaaa cgaggcccga agatttccga       ',
 '   gacctcagtt tccccaaact catcatgatc actgattact tgctgctctt ccgggtctat       ',
 '   gggctcgaga gcctgaagga cctgttcccc aacctcacgg tcatccgggg atcacgactg       ',
 '   ttctttaact acgcgctggt catcttcgag atggttcacc tcaaggaact cggcctctac       ',
 '   aacctgatga acatcacccg gggttctgtc cgcatcgaga agaacaatga gctctgttac       ',
 '   ttggccacta tcgactggtc ccgtatcctg gattccgtgg aggataatca catcgtgttg       ',
 '   aacaaagatg acaacgagga gtgtggagac atctgtccgg gtaccgcgaa gggcaagacc       ',
 '   aactgccccg ccaccgtcat caacgggcag tttgtcgaac gatgttggac tcatagtcac       ',
 '   tgccagaaag tttgcccgac catctgtaag tcacacggct gcaccgccga aggcctctgt       ',
 '   tgccacagcg agtgcctggg caactgttct cagcccgacg accccaccaa gtgcgtggcc       ',
 '   tgccgcaact tctacctgga cggcaggtgt gtggagacct gcccgccccc gtactaccac       ',
 '   ttccaggact ggcgctgtgt gaacttcagc ttctgccagg acctgcacca caaatgcaag      ',
 '   aactcgcgga ggcagggctg ccaccaatac gtcattcaca acaacaagtg catccctgag      ',
 '   tgtccctccg ggtacacgat gaattccagc aacttgctgt gcaccccatg cctgggtccc      ',
 '   tgtcccaagg tgtgccacct cctagaaggc gagaagacca tcgactcggt gacgtctgcc      ',
 '   caggagctcc gaggatgcac cgtcatcaac gggagtctga tcatcaacat tcgaggaggc      ',
 '   aacaatctgg cagctgagct agaagccaac ctcggcctca ttgaagaaat ttcagggtat      ',
 '   ctaaaaatcc gccgatccta cgctctggtg tcactttcct tcttccggaa gttacgtctg      ',
 '   attcgaggag agaccttgga aattgggaac tactccttct atgccttgga caaccagaac      ',
 '   ctaaggcagc tctgggactg gagcaaacac aacctcacca ccactcaggg gaaactcttc      ',
 '   ttccactata accccaaact ctgcttgtca gaaatccaca agatggaaga agtttcagga      ',
 '   accaaggggc gccaggagag aaacgacatt gccctgaaga ccaatgggga caaggcatcc      ',
 '   tgtgaaaatg agttacttaa attttcttac attcggacat cttttgacaa gatcttgctg      ',
 '   agatgggagc cgtactggcc ccccgacttc cgagacctct tggggttcat gctgttctac      ',
 '   aaagaggccc cttatcagaa tgtgacggag ttcgatgggc aggatgcgtg tggttccaac      ',
 '   agttggacgg tggtagacat tgacccaccc ctgaggtcca acgaccccaa atcacagaac      ',
 '   cacccagggt ggctgatgcg gggtctcaag ccctggaccc agtatgccat ctttgtgaag      ',
 '   accctggtca ccttttcgga tgaacgccgg acctatgggg ccaagagtga catcatttat      ',
 '   gtccagacag atgccaccaa cccctctgtg cccctggatc caatctcagt gtctaactca      ',
 '   tcatcccaga ttattctgaa gtggaaacca ccctccgacc ccaatggcaa catcacccac      ',
 '   tacctggttt tctgggagag gcaggcggaa gacagtgagc tgttcgagct ggattattgc      ',
 '   ctcaaagggc tgaagctgcc ctcgaggacc tggtctccac cattcgagtc tgaagattct      ',
 '   cagaagcaca accagagtga gtatgaggat tcggccggcg aatgctgctc ctgtccaaag      ',
 '   acagactctc agatcctgaa ggagctggag gagtcctcgt ttaggaagac gtttgaggat      ',
 '   tacctgcaca acgtggtttt cgtccccaga aaaacctctt caggcactgg tgccgaggac      ',
 '   cctaggccat ctcggaaacg caggtccctt ggcgatgttg ggaatgtgac ggtggccgtg      ',
 '   cccacggtgg cagctttccc caacacttcc tcgaccagcg tgcccacgag tccggaggag      ',
 '   cacaggcctt ttgagaaggt ggtgaacaag gagtcgctgg tcatctccgg cttgcgacac      ',
 '   ttcacgggct atcgcatcga gctgcaggct tgcaaccagg acacccctga ggaacggtgc      ',
 '   agtgtggcag cctacgtcag tgcgaggacc atgcctgaag ccaaggctga tgacattgtt      ',
 '   ggccctgtga cgcatgaaat ctttgagaac aacgtcgtcc acttgatgtg gcaggagccg      ',
 '   aaggagccca atggtctgat cgtgctgtat gaagtgagtt atcggcgata tggtgatgag      ',
 '   gagctgcatc tctgcgtctc ccgcaagcac ttcgctctgg aacggggctg caggctgcgt      ',
 '   gggctgtcac cggggaacta cagcgtgcga atccgggcca cctcccttgc gggcaacggc      ',
 '   tcttggacgg aacccaccta tttctacgtg acagactatt tagacgtccc gtcaaatatt      ',
 '   gcaaaaatta tcatcggccc cctcatcttt gtctttctct tcagtgttgt gattggaagt      ',
 '   atttatctat tcctgagaaa gaggcagcca gatgggccgc tgggaccgct ttacgcttct      ',
 '   tcaaaccctg agtatctcag tgccagtgat gtgtttccat gctctgtgta cgtgccggac      ',
 '   gagtgggagg tgtctcgaga gaagatcacc ctccttcgag agctggggca gggctccttc      ',
 '   ggcatggtgt atgagggcaa tgccagggac atcatcaagg gtgaggcaga gacccgcgtg      ',
 '   gcggtgaaga cggtcaacga gtcagccagt ctccgagagc ggattgagtt cctcaatgag      ',
 '   gcctcggtca tgaagggctt cacctgccat cacgtggtgc gcctcctggg agtggtgtcc      ',
 '   aagggccagc ccacgctggt ggtgatggag ctgatggctc acggagacct gaagagctac      ',
 '   ctccgttctc tgcggccaga ggctgagaat aatcctggcc gccctccccc tacccttcaa      ',
 '   gagatgattc agatggcggc agagattgct gacgggatgg cctacctgaa cgccaagaag      ',
 '   tttgtgcatc gggacctggc agcgagaaac tgcatggtcg cccatgattt tactgtcaaa      ',
 '   attggagact ttggaatgac cagagacatc tatgaaacgg attactaccg gaaagggggc      ',
 '   aagggtctgc tccctgtacg gtggatggca ccggagtccc tgaaggatgg ggtcttcacc      ',
 '   acttcttctg acatgtggtc ctttggcgtg gtcctttggg aaatcaccag cttggcagaa      ',
 '   cagccttacc aaggcctgtc taatgaacag gtgttgaaat ttgtcatgga tggagggtat      ',
 '   ctggatcaac ccgacaactg tccagagaga gtcactgacc tcatgcgcat gtgctggcaa      ',
 '   ttcaacccca agatgaggcc aaccttcctg gagattgtca acctgctcaa ggacgacctg      ',
 '   caccccagct ttccagaggt gtcgttcttc cacagcgagg agaacaaggc tcccgagagt      ',
 '   gaggagctgg agatggagtt tgaggacatg gagaatgtgc ccctggaccg ttcctcgcac      ',
 '   tgtcagaggg aggaggcggg gggccgggat ggagggtcct cgctgggttt caagcggagc      ',
 '   tacgaggaac acatccctta cacacacatg aacggaggca agaaaaacgg gcggattctg      ',
 '   accttgcctc ggtccaatcc ttcctaacag tgcctaccgt ggcgggggcg ggcaggggtt      ',
 '   cccattttcg ctttcctctg gtttgaaagc ctctggaaaa ctcaggattc tcacgactct      ',
 '   accatgtcca gtggagttca gagatcgttc ctatacattt ctgttcatct taaggtggac      ',
 '   tcgtttggtt accaatttaa ctagtcctgc agaggattta actgtgaacc tggagggcaa      ',
 '   ggggtttcca cagttgctgc tcctttgggg caacgacggt ttcaaaccag gattttgtgt      ',
 '   tttttcgttc cccccacccg cccccagcag atggaaagaa agcacctgtt tttacaaatt      ',
 '   cttttttttt tttttttttt tttttttttg ctggtgtctg agcttcagta taaaagacaa      ',
 '   aacttcctgt ttgtggaaca aaatttcgaa agaaaaaacc aaa                        ']

Estrarre da seq_row_list la lista seq_chunk_list contenente i chunks di (al più) lunghezza 10 della sequenza nucleotidica.

NOTA BENE: l'elemento seq_chunk_list[i] è una lista annidata e contiene i sei chunks relativi all'i-esimo record di seq_row_list.


In [18]:
seq_chunk_list = [re.findall('\w+', row) for row in seq_row_list]

In [19]:
seq_chunk_list


Out[19]:
[['ggggggctgc',
  'gcggccgggt',
  'cggtgcgcac',
  'acgagaagga',
  'cgcgcggccc',
  'ccagcgctct'],
 ['tgggggccgc',
  'ctcggagcat',
  'gacccccgcg',
  'ggccagcgcc',
  'gcgcgcctga',
  'tccgaggaga'],
 ['ccccgcgctc',
  'ccgcagccat',
  'gggcaccggg',
  'ggccggcggg',
  'gggcggcggc',
  'cgcgccgctg'],
 ['ctggtggcgg',
  'tggccgcgct',
  'gctactgggc',
  'gccgcgggcc',
  'acctgtaccc',
  'cggagaggtg'],
 ['tgtcccggca',
  'tggatatccg',
  'gaacaacctc',
  'actaggttgc',
  'atgagctgga',
  'gaattgctct'],
 ['gtcatcgaag',
  'gacacttgca',
  'gatactcttg',
  'atgttcaaaa',
  'cgaggcccga',
  'agatttccga'],
 ['gacctcagtt',
  'tccccaaact',
  'catcatgatc',
  'actgattact',
  'tgctgctctt',
  'ccgggtctat'],
 ['gggctcgaga',
  'gcctgaagga',
  'cctgttcccc',
  'aacctcacgg',
  'tcatccgggg',
  'atcacgactg'],
 ['ttctttaact',
  'acgcgctggt',
  'catcttcgag',
  'atggttcacc',
  'tcaaggaact',
  'cggcctctac'],
 ['aacctgatga',
  'acatcacccg',
  'gggttctgtc',
  'cgcatcgaga',
  'agaacaatga',
  'gctctgttac'],
 ['ttggccacta',
  'tcgactggtc',
  'ccgtatcctg',
  'gattccgtgg',
  'aggataatca',
  'catcgtgttg'],
 ['aacaaagatg',
  'acaacgagga',
  'gtgtggagac',
  'atctgtccgg',
  'gtaccgcgaa',
  'gggcaagacc'],
 ['aactgccccg',
  'ccaccgtcat',
  'caacgggcag',
  'tttgtcgaac',
  'gatgttggac',
  'tcatagtcac'],
 ['tgccagaaag',
  'tttgcccgac',
  'catctgtaag',
  'tcacacggct',
  'gcaccgccga',
  'aggcctctgt'],
 ['tgccacagcg',
  'agtgcctggg',
  'caactgttct',
  'cagcccgacg',
  'accccaccaa',
  'gtgcgtggcc'],
 ['tgccgcaact',
  'tctacctgga',
  'cggcaggtgt',
  'gtggagacct',
  'gcccgccccc',
  'gtactaccac'],
 ['ttccaggact',
  'ggcgctgtgt',
  'gaacttcagc',
  'ttctgccagg',
  'acctgcacca',
  'caaatgcaag'],
 ['aactcgcgga',
  'ggcagggctg',
  'ccaccaatac',
  'gtcattcaca',
  'acaacaagtg',
  'catccctgag'],
 ['tgtccctccg',
  'ggtacacgat',
  'gaattccagc',
  'aacttgctgt',
  'gcaccccatg',
  'cctgggtccc'],
 ['tgtcccaagg',
  'tgtgccacct',
  'cctagaaggc',
  'gagaagacca',
  'tcgactcggt',
  'gacgtctgcc'],
 ['caggagctcc',
  'gaggatgcac',
  'cgtcatcaac',
  'gggagtctga',
  'tcatcaacat',
  'tcgaggaggc'],
 ['aacaatctgg',
  'cagctgagct',
  'agaagccaac',
  'ctcggcctca',
  'ttgaagaaat',
  'ttcagggtat'],
 ['ctaaaaatcc',
  'gccgatccta',
  'cgctctggtg',
  'tcactttcct',
  'tcttccggaa',
  'gttacgtctg'],
 ['attcgaggag',
  'agaccttgga',
  'aattgggaac',
  'tactccttct',
  'atgccttgga',
  'caaccagaac'],
 ['ctaaggcagc',
  'tctgggactg',
  'gagcaaacac',
  'aacctcacca',
  'ccactcaggg',
  'gaaactcttc'],
 ['ttccactata',
  'accccaaact',
  'ctgcttgtca',
  'gaaatccaca',
  'agatggaaga',
  'agtttcagga'],
 ['accaaggggc',
  'gccaggagag',
  'aaacgacatt',
  'gccctgaaga',
  'ccaatgggga',
  'caaggcatcc'],
 ['tgtgaaaatg',
  'agttacttaa',
  'attttcttac',
  'attcggacat',
  'cttttgacaa',
  'gatcttgctg'],
 ['agatgggagc',
  'cgtactggcc',
  'ccccgacttc',
  'cgagacctct',
  'tggggttcat',
  'gctgttctac'],
 ['aaagaggccc',
  'cttatcagaa',
  'tgtgacggag',
  'ttcgatgggc',
  'aggatgcgtg',
  'tggttccaac'],
 ['agttggacgg',
  'tggtagacat',
  'tgacccaccc',
  'ctgaggtcca',
  'acgaccccaa',
  'atcacagaac'],
 ['cacccagggt',
  'ggctgatgcg',
  'gggtctcaag',
  'ccctggaccc',
  'agtatgccat',
  'ctttgtgaag'],
 ['accctggtca',
  'ccttttcgga',
  'tgaacgccgg',
  'acctatgggg',
  'ccaagagtga',
  'catcatttat'],
 ['gtccagacag',
  'atgccaccaa',
  'cccctctgtg',
  'cccctggatc',
  'caatctcagt',
  'gtctaactca'],
 ['tcatcccaga',
  'ttattctgaa',
  'gtggaaacca',
  'ccctccgacc',
  'ccaatggcaa',
  'catcacccac'],
 ['tacctggttt',
  'tctgggagag',
  'gcaggcggaa',
  'gacagtgagc',
  'tgttcgagct',
  'ggattattgc'],
 ['ctcaaagggc',
  'tgaagctgcc',
  'ctcgaggacc',
  'tggtctccac',
  'cattcgagtc',
  'tgaagattct'],
 ['cagaagcaca',
  'accagagtga',
  'gtatgaggat',
  'tcggccggcg',
  'aatgctgctc',
  'ctgtccaaag'],
 ['acagactctc',
  'agatcctgaa',
  'ggagctggag',
  'gagtcctcgt',
  'ttaggaagac',
  'gtttgaggat'],
 ['tacctgcaca',
  'acgtggtttt',
  'cgtccccaga',
  'aaaacctctt',
  'caggcactgg',
  'tgccgaggac'],
 ['cctaggccat',
  'ctcggaaacg',
  'caggtccctt',
  'ggcgatgttg',
  'ggaatgtgac',
  'ggtggccgtg'],
 ['cccacggtgg',
  'cagctttccc',
  'caacacttcc',
  'tcgaccagcg',
  'tgcccacgag',
  'tccggaggag'],
 ['cacaggcctt',
  'ttgagaaggt',
  'ggtgaacaag',
  'gagtcgctgg',
  'tcatctccgg',
  'cttgcgacac'],
 ['ttcacgggct',
  'atcgcatcga',
  'gctgcaggct',
  'tgcaaccagg',
  'acacccctga',
  'ggaacggtgc'],
 ['agtgtggcag',
  'cctacgtcag',
  'tgcgaggacc',
  'atgcctgaag',
  'ccaaggctga',
  'tgacattgtt'],
 ['ggccctgtga',
  'cgcatgaaat',
  'ctttgagaac',
  'aacgtcgtcc',
  'acttgatgtg',
  'gcaggagccg'],
 ['aaggagccca',
  'atggtctgat',
  'cgtgctgtat',
  'gaagtgagtt',
  'atcggcgata',
  'tggtgatgag'],
 ['gagctgcatc',
  'tctgcgtctc',
  'ccgcaagcac',
  'ttcgctctgg',
  'aacggggctg',
  'caggctgcgt'],
 ['gggctgtcac',
  'cggggaacta',
  'cagcgtgcga',
  'atccgggcca',
  'cctcccttgc',
  'gggcaacggc'],
 ['tcttggacgg',
  'aacccaccta',
  'tttctacgtg',
  'acagactatt',
  'tagacgtccc',
  'gtcaaatatt'],
 ['gcaaaaatta',
  'tcatcggccc',
  'cctcatcttt',
  'gtctttctct',
  'tcagtgttgt',
  'gattggaagt'],
 ['atttatctat',
  'tcctgagaaa',
  'gaggcagcca',
  'gatgggccgc',
  'tgggaccgct',
  'ttacgcttct'],
 ['tcaaaccctg',
  'agtatctcag',
  'tgccagtgat',
  'gtgtttccat',
  'gctctgtgta',
  'cgtgccggac'],
 ['gagtgggagg',
  'tgtctcgaga',
  'gaagatcacc',
  'ctccttcgag',
  'agctggggca',
  'gggctccttc'],
 ['ggcatggtgt',
  'atgagggcaa',
  'tgccagggac',
  'atcatcaagg',
  'gtgaggcaga',
  'gacccgcgtg'],
 ['gcggtgaaga',
  'cggtcaacga',
  'gtcagccagt',
  'ctccgagagc',
  'ggattgagtt',
  'cctcaatgag'],
 ['gcctcggtca',
  'tgaagggctt',
  'cacctgccat',
  'cacgtggtgc',
  'gcctcctggg',
  'agtggtgtcc'],
 ['aagggccagc',
  'ccacgctggt',
  'ggtgatggag',
  'ctgatggctc',
  'acggagacct',
  'gaagagctac'],
 ['ctccgttctc',
  'tgcggccaga',
  'ggctgagaat',
  'aatcctggcc',
  'gccctccccc',
  'tacccttcaa'],
 ['gagatgattc',
  'agatggcggc',
  'agagattgct',
  'gacgggatgg',
  'cctacctgaa',
  'cgccaagaag'],
 ['tttgtgcatc',
  'gggacctggc',
  'agcgagaaac',
  'tgcatggtcg',
  'cccatgattt',
  'tactgtcaaa'],
 ['attggagact',
  'ttggaatgac',
  'cagagacatc',
  'tatgaaacgg',
  'attactaccg',
  'gaaagggggc'],
 ['aagggtctgc',
  'tccctgtacg',
  'gtggatggca',
  'ccggagtccc',
  'tgaaggatgg',
  'ggtcttcacc'],
 ['acttcttctg',
  'acatgtggtc',
  'ctttggcgtg',
  'gtcctttggg',
  'aaatcaccag',
  'cttggcagaa'],
 ['cagccttacc',
  'aaggcctgtc',
  'taatgaacag',
  'gtgttgaaat',
  'ttgtcatgga',
  'tggagggtat'],
 ['ctggatcaac',
  'ccgacaactg',
  'tccagagaga',
  'gtcactgacc',
  'tcatgcgcat',
  'gtgctggcaa'],
 ['ttcaacccca',
  'agatgaggcc',
  'aaccttcctg',
  'gagattgtca',
  'acctgctcaa',
  'ggacgacctg'],
 ['caccccagct',
  'ttccagaggt',
  'gtcgttcttc',
  'cacagcgagg',
  'agaacaaggc',
  'tcccgagagt'],
 ['gaggagctgg',
  'agatggagtt',
  'tgaggacatg',
  'gagaatgtgc',
  'ccctggaccg',
  'ttcctcgcac'],
 ['tgtcagaggg',
  'aggaggcggg',
  'gggccgggat',
  'ggagggtcct',
  'cgctgggttt',
  'caagcggagc'],
 ['tacgaggaac',
  'acatccctta',
  'cacacacatg',
  'aacggaggca',
  'agaaaaacgg',
  'gcggattctg'],
 ['accttgcctc',
  'ggtccaatcc',
  'ttcctaacag',
  'tgcctaccgt',
  'ggcgggggcg',
  'ggcaggggtt'],
 ['cccattttcg',
  'ctttcctctg',
  'gtttgaaagc',
  'ctctggaaaa',
  'ctcaggattc',
  'tcacgactct'],
 ['accatgtcca',
  'gtggagttca',
  'gagatcgttc',
  'ctatacattt',
  'ctgttcatct',
  'taaggtggac'],
 ['tcgtttggtt',
  'accaatttaa',
  'ctagtcctgc',
  'agaggattta',
  'actgtgaacc',
  'tggagggcaa'],
 ['ggggtttcca',
  'cagttgctgc',
  'tcctttgggg',
  'caacgacggt',
  'ttcaaaccag',
  'gattttgtgt'],
 ['tttttcgttc',
  'cccccacccg',
  'cccccagcag',
  'atggaaagaa',
  'agcacctgtt',
  'tttacaaatt'],
 ['cttttttttt',
  'tttttttttt',
  'tttttttttg',
  'ctggtgtctg',
  'agcttcagta',
  'taaaagacaa'],
 ['aacttcctgt', 'ttgtggaaca', 'aaatttcgaa', 'agaaaaaacc', 'aaa']]

Concatenare i chunks della lista seq_chunk_list per ottenere la sequenza nucleotidica nella variabile nucleotide_sequence.


In [20]:
nucleotide_sequence = ''.join([''.join(list_six_chunks) for list_six_chunks in seq_chunk_list])

In [21]:
nucleotide_sequence


Out[21]:
'ggggggctgcgcggccgggtcggtgcgcacacgagaaggacgcgcggcccccagcgctcttgggggccgcctcggagcatgacccccgcgggccagcgccgcgcgcctgatccgaggagaccccgcgctcccgcagccatgggcaccgggggccggcggggggcggcggccgcgccgctgctggtggcggtggccgcgctgctactgggcgccgcgggccacctgtaccccggagaggtgtgtcccggcatggatatccggaacaacctcactaggttgcatgagctggagaattgctctgtcatcgaaggacacttgcagatactcttgatgttcaaaacgaggcccgaagatttccgagacctcagtttccccaaactcatcatgatcactgattacttgctgctcttccgggtctatgggctcgagagcctgaaggacctgttccccaacctcacggtcatccggggatcacgactgttctttaactacgcgctggtcatcttcgagatggttcacctcaaggaactcggcctctacaacctgatgaacatcacccggggttctgtccgcatcgagaagaacaatgagctctgttacttggccactatcgactggtcccgtatcctggattccgtggaggataatcacatcgtgttgaacaaagatgacaacgaggagtgtggagacatctgtccgggtaccgcgaagggcaagaccaactgccccgccaccgtcatcaacgggcagtttgtcgaacgatgttggactcatagtcactgccagaaagtttgcccgaccatctgtaagtcacacggctgcaccgccgaaggcctctgttgccacagcgagtgcctgggcaactgttctcagcccgacgaccccaccaagtgcgtggcctgccgcaacttctacctggacggcaggtgtgtggagacctgcccgcccccgtactaccacttccaggactggcgctgtgtgaacttcagcttctgccaggacctgcaccacaaatgcaagaactcgcggaggcagggctgccaccaatacgtcattcacaacaacaagtgcatccctgagtgtccctccgggtacacgatgaattccagcaacttgctgtgcaccccatgcctgggtccctgtcccaaggtgtgccacctcctagaaggcgagaagaccatcgactcggtgacgtctgcccaggagctccgaggatgcaccgtcatcaacgggagtctgatcatcaacattcgaggaggcaacaatctggcagctgagctagaagccaacctcggcctcattgaagaaatttcagggtatctaaaaatccgccgatcctacgctctggtgtcactttccttcttccggaagttacgtctgattcgaggagagaccttggaaattgggaactactccttctatgccttggacaaccagaacctaaggcagctctgggactggagcaaacacaacctcaccaccactcaggggaaactcttcttccactataaccccaaactctgcttgtcagaaatccacaagatggaagaagtttcaggaaccaaggggcgccaggagagaaacgacattgccctgaagaccaatggggacaaggcatcctgtgaaaatgagttacttaaattttcttacattcggacatcttttgacaagatcttgctgagatgggagccgtactggccccccgacttccgagacctcttggggttcatgctgttctacaaagaggccccttatcagaatgtgacggagttcgatgggcaggatgcgtgtggttccaacagttggacggtggtagacattgacccacccctgaggtccaacgaccccaaatcacagaaccacccagggtggctgatgcggggtctcaagccctggacccagtatgccatctttgtgaagaccctggtcaccttttcggatgaacgccggacctatggggccaagagtgacatcatttatgtccagacagatgccaccaacccctctgtgcccctggatccaatctcagtgtctaactcatcatcccagattattctgaagtggaaaccaccctccgaccccaatggcaacatcacccactacctggttttctgggagaggcaggcggaagacagtgagctgttcgagctggattattgcctcaaagggctgaagctgccctcgaggacctggtctccaccattcgagtctgaagattctcagaagcacaaccagagtgagtatgaggattcggccggcgaatgctgctcctgtccaaagacagactctcagatcctgaaggagctggaggagtcctcgtttaggaagacgtttgaggattacctgcacaacgtggttttcgtccccagaaaaacctcttcaggcactggtgccgaggaccctaggccatctcggaaacgcaggtcccttggcgatgttgggaatgtgacggtggccgtgcccacggtggcagctttccccaacacttcctcgaccagcgtgcccacgagtccggaggagcacaggccttttgagaaggtggtgaacaaggagtcgctggtcatctccggcttgcgacacttcacgggctatcgcatcgagctgcaggcttgcaaccaggacacccctgaggaacggtgcagtgtggcagcctacgtcagtgcgaggaccatgcctgaagccaaggctgatgacattgttggccctgtgacgcatgaaatctttgagaacaacgtcgtccacttgatgtggcaggagccgaaggagcccaatggtctgatcgtgctgtatgaagtgagttatcggcgatatggtgatgaggagctgcatctctgcgtctcccgcaagcacttcgctctggaacggggctgcaggctgcgtgggctgtcaccggggaactacagcgtgcgaatccgggccacctcccttgcgggcaacggctcttggacggaacccacctatttctacgtgacagactatttagacgtcccgtcaaatattgcaaaaattatcatcggccccctcatctttgtctttctcttcagtgttgtgattggaagtatttatctattcctgagaaagaggcagccagatgggccgctgggaccgctttacgcttcttcaaaccctgagtatctcagtgccagtgatgtgtttccatgctctgtgtacgtgccggacgagtgggaggtgtctcgagagaagatcaccctccttcgagagctggggcagggctccttcggcatggtgtatgagggcaatgccagggacatcatcaagggtgaggcagagacccgcgtggcggtgaagacggtcaacgagtcagccagtctccgagagcggattgagttcctcaatgaggcctcggtcatgaagggcttcacctgccatcacgtggtgcgcctcctgggagtggtgtccaagggccagcccacgctggtggtgatggagctgatggctcacggagacctgaagagctacctccgttctctgcggccagaggctgagaataatcctggccgccctccccctacccttcaagagatgattcagatggcggcagagattgctgacgggatggcctacctgaacgccaagaagtttgtgcatcgggacctggcagcgagaaactgcatggtcgcccatgattttactgtcaaaattggagactttggaatgaccagagacatctatgaaacggattactaccggaaagggggcaagggtctgctccctgtacggtggatggcaccggagtccctgaaggatggggtcttcaccacttcttctgacatgtggtcctttggcgtggtcctttgggaaatcaccagcttggcagaacagccttaccaaggcctgtctaatgaacaggtgttgaaatttgtcatggatggagggtatctggatcaacccgacaactgtccagagagagtcactgacctcatgcgcatgtgctggcaattcaaccccaagatgaggccaaccttcctggagattgtcaacctgctcaaggacgacctgcaccccagctttccagaggtgtcgttcttccacagcgaggagaacaaggctcccgagagtgaggagctggagatggagtttgaggacatggagaatgtgcccctggaccgttcctcgcactgtcagagggaggaggcggggggccgggatggagggtcctcgctgggtttcaagcggagctacgaggaacacatcccttacacacacatgaacggaggcaagaaaaacgggcggattctgaccttgcctcggtccaatccttcctaacagtgcctaccgtggcgggggcgggcaggggttcccattttcgctttcctctggtttgaaagcctctggaaaactcaggattctcacgactctaccatgtccagtggagttcagagatcgttcctatacatttctgttcatcttaaggtggactcgtttggttaccaatttaactagtcctgcagaggatttaactgtgaacctggagggcaaggggtttccacagttgctgctcctttggggcaacgacggtttcaaaccaggattttgtgttttttcgttccccccacccgcccccagcagatggaaagaaagcacctgtttttacaaattcttttttttttttttttttttttttttttgctggtgtctgagcttcagtataaaagacaaaacttcctgtttgtggaacaaaatttcgaaagaaaaaaccaaa'

Estrazione della sequenza della proteina

Estrarre nella variabile protein_prefix il prefisso della proteina contenuto nel record contenente la parola /translation:

FT                   /translation="MGTGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPGMDIRNNLT

In [22]:
s = re.search('^FT\s+\/translation=\"(\w+)$', file_str, re.M)
protein_prefix = s.group(1)

In [23]:
protein_prefix


Out[23]:
'MGTGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPGMDIRNNLT'

Estrarre nella lista protein_row_list gli altri record (compreso l'ultimo) che contengono la sequenza della proteina.

FT                   RLHELENCSVIEGHLQILLMFKTRPEDFRDLSFPKLIMITDYLLLFRVYGLESLKDLFP

Attenzione all'ultimo:

FT                   DGGSSLGFKRSYEEHIPYTHMNGGKKNGRILTLPRSNPS"

In [24]:
protein_row_list = re.findall('^FT\s+([ACDEFGHIKLMNPQRSTVWY]+)\"?$', file_str, re.M)

In [25]:
protein_row_list


Out[25]:
['RLHELENCSVIEGHLQILLMFKTRPEDFRDLSFPKLIMITDYLLLFRVYGLESLKDLFP',
 'NLTVIRGSRLFFNYALVIFEMVHLKELGLYNLMNITRGSVRIEKNNELCYLATIDWSRI',
 'LDSVEDNHIVLNKDDNEECGDICPGTAKGKTNCPATVINGQFVERCWTHSHCQKVCPTI',
 'CKSHGCTAEGLCCHSECLGNCSQPDDPTKCVACRNFYLDGRCVETCPPPYYHFQDWRCV',
 'NFSFCQDLHHKCKNSRRQGCHQYVIHNNKCIPECPSGYTMNSSNLLCTPCLGPCPKVCH',
 'LLEGEKTIDSVTSAQELRGCTVINGSLIINIRGGNNLAAELEANLGLIEEISGYLKIRR',
 'SYALVSLSFFRKLRLIRGETLEIGNYSFYALDNQNLRQLWDWSKHNLTTTQGKLFFHYN',
 'PKLCLSEIHKMEEVSGTKGRQERNDIALKTNGDKASCENELLKFSYIRTSFDKILLRWE',
 'PYWPPDFRDLLGFMLFYKEAPYQNVTEFDGQDACGSNSWTVVDIDPPLRSNDPKSQNHP',
 'GWLMRGLKPWTQYAIFVKTLVTFSDERRTYGAKSDIIYVQTDATNPSVPLDPISVSNSS',
 'SQIILKWKPPSDPNGNITHYLVFWERQAEDSELFELDYCLKGLKLPSRTWSPPFESEDS',
 'QKHNQSEYEDSAGECCSCPKTDSQILKELEESSFRKTFEDYLHNVVFVPRKTSSGTGAE',
 'DPRPSRKRRSLGDVGNVTVAVPTVAAFPNTSSTSVPTSPEEHRPFEKVVNKESLVISGL',
 'RHFTGYRIELQACNQDTPEERCSVAAYVSARTMPEAKADDIVGPVTHEIFENNVVHLMW',
 'QEPKEPNGLIVLYEVSYRRYGDEELHLCVSRKHFALERGCRLRGLSPGNYSVRIRATSL',
 'AGNGSWTEPTYFYVTDYLDVPSNIAKIIIGPLIFVFLFSVVIGSIYLFLRKRQPDGPLG',
 'PLYASSNPEYLSASDVFPCSVYVPDEWEVSREKITLLRELGQGSFGMVYEGNARDIIKG',
 'EAETRVAVKTVNESASLRERIEFLNEASVMKGFTCHHVVRLLGVVSKGQPTLVVMELMA',
 'HGDLKSYLRSLRPEAENNPGRPPPTLQEMIQMAAEIADGMAYLNAKKFVHRDLAARNCM',
 'VAHDFTVKIGDFGMTRDIYETDYYRKGGKGLLPVRWMAPESLKDGVFTTSSDMWSFGVV',
 'LWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDNCPERVTDLMRMCWQFNPKMRPTFL',
 'EIVNLLKDDLHPSFPEVSFFHSEENKAPESEELEMEFEDMENVPLDRSSHCQREEAGGR',
 'DGGSSLGFKRSYEEHIPYTHMNGGKKNGRILTLPRSNPS']

Aggiungere in testa alla lista il prefisso trovato prima e concatenare nella variabile protein_sequence tutti i blocchi della lista protein_row_list per ottenere la sequenza della proteina.


In [26]:
protein_row_list[:0] = protein_prefix
protein_sequence = ''.join([''.join(chunk) for chunk in protein_row_list])

In [27]:
protein_sequence


Out[27]:
'MGTGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPGMDIRNNLTRLHELENCSVIEGHLQILLMFKTRPEDFRDLSFPKLIMITDYLLLFRVYGLESLKDLFPNLTVIRGSRLFFNYALVIFEMVHLKELGLYNLMNITRGSVRIEKNNELCYLATIDWSRILDSVEDNHIVLNKDDNEECGDICPGTAKGKTNCPATVINGQFVERCWTHSHCQKVCPTICKSHGCTAEGLCCHSECLGNCSQPDDPTKCVACRNFYLDGRCVETCPPPYYHFQDWRCVNFSFCQDLHHKCKNSRRQGCHQYVIHNNKCIPECPSGYTMNSSNLLCTPCLGPCPKVCHLLEGEKTIDSVTSAQELRGCTVINGSLIINIRGGNNLAAELEANLGLIEEISGYLKIRRSYALVSLSFFRKLRLIRGETLEIGNYSFYALDNQNLRQLWDWSKHNLTTTQGKLFFHYNPKLCLSEIHKMEEVSGTKGRQERNDIALKTNGDKASCENELLKFSYIRTSFDKILLRWEPYWPPDFRDLLGFMLFYKEAPYQNVTEFDGQDACGSNSWTVVDIDPPLRSNDPKSQNHPGWLMRGLKPWTQYAIFVKTLVTFSDERRTYGAKSDIIYVQTDATNPSVPLDPISVSNSSSQIILKWKPPSDPNGNITHYLVFWERQAEDSELFELDYCLKGLKLPSRTWSPPFESEDSQKHNQSEYEDSAGECCSCPKTDSQILKELEESSFRKTFEDYLHNVVFVPRKTSSGTGAEDPRPSRKRRSLGDVGNVTVAVPTVAAFPNTSSTSVPTSPEEHRPFEKVVNKESLVISGLRHFTGYRIELQACNQDTPEERCSVAAYVSARTMPEAKADDIVGPVTHEIFENNVVHLMWQEPKEPNGLIVLYEVSYRRYGDEELHLCVSRKHFALERGCRLRGLSPGNYSVRIRATSLAGNGSWTEPTYFYVTDYLDVPSNIAKIIIGPLIFVFLFSVVIGSIYLFLRKRQPDGPLGPLYASSNPEYLSASDVFPCSVYVPDEWEVSREKITLLRELGQGSFGMVYEGNARDIIKGEAETRVAVKTVNESASLRERIEFLNEASVMKGFTCHHVVRLLGVVSKGQPTLVVMELMAHGDLKSYLRSLRPEAENNPGRPPPTLQEMIQMAAEIADGMAYLNAKKFVHRDLAARNCMVAHDFTVKIGDFGMTRDIYETDYYRKGGKGLLPVRWMAPESLKDGVFTTSSDMWSFGVVLWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDNCPERVTDLMRMCWQFNPKMRPTFLEIVNLLKDDLHPSFPEVSFFHSEENKAPESEELEMEFEDMENVPLDRSSHCQREEAGGRDGGSSLGFKRSYEEHIPYTHMNGGKKNGRILTLPRSNPS'

Produzione della coding sequence (CDS) in formato FASTA

Estrarre dal record

FT   CDS             139..4287

lo start e l'end della CDS sulla sequenza nucleotidica dell'mRNA.


In [28]:
s = re.search('^FT\s+CDS\s+(\d+)\.\.(\d+)$', file_str, re.M)
cds_start = s.group(1)
cds_end = s.group(2)

In [29]:
cds_start


Out[29]:
'139'

In [30]:
cds_end


Out[30]:
'4287'

Ottenere la coding sequence (CDS)


In [31]:
cds_sequence = nucleotide_sequence[int(cds_start)-1:int(cds_end)]

In [32]:
cds_sequence


Out[32]:
'atgggcaccgggggccggcggggggcggcggccgcgccgctgctggtggcggtggccgcgctgctactgggcgccgcgggccacctgtaccccggagaggtgtgtcccggcatggatatccggaacaacctcactaggttgcatgagctggagaattgctctgtcatcgaaggacacttgcagatactcttgatgttcaaaacgaggcccgaagatttccgagacctcagtttccccaaactcatcatgatcactgattacttgctgctcttccgggtctatgggctcgagagcctgaaggacctgttccccaacctcacggtcatccggggatcacgactgttctttaactacgcgctggtcatcttcgagatggttcacctcaaggaactcggcctctacaacctgatgaacatcacccggggttctgtccgcatcgagaagaacaatgagctctgttacttggccactatcgactggtcccgtatcctggattccgtggaggataatcacatcgtgttgaacaaagatgacaacgaggagtgtggagacatctgtccgggtaccgcgaagggcaagaccaactgccccgccaccgtcatcaacgggcagtttgtcgaacgatgttggactcatagtcactgccagaaagtttgcccgaccatctgtaagtcacacggctgcaccgccgaaggcctctgttgccacagcgagtgcctgggcaactgttctcagcccgacgaccccaccaagtgcgtggcctgccgcaacttctacctggacggcaggtgtgtggagacctgcccgcccccgtactaccacttccaggactggcgctgtgtgaacttcagcttctgccaggacctgcaccacaaatgcaagaactcgcggaggcagggctgccaccaatacgtcattcacaacaacaagtgcatccctgagtgtccctccgggtacacgatgaattccagcaacttgctgtgcaccccatgcctgggtccctgtcccaaggtgtgccacctcctagaaggcgagaagaccatcgactcggtgacgtctgcccaggagctccgaggatgcaccgtcatcaacgggagtctgatcatcaacattcgaggaggcaacaatctggcagctgagctagaagccaacctcggcctcattgaagaaatttcagggtatctaaaaatccgccgatcctacgctctggtgtcactttccttcttccggaagttacgtctgattcgaggagagaccttggaaattgggaactactccttctatgccttggacaaccagaacctaaggcagctctgggactggagcaaacacaacctcaccaccactcaggggaaactcttcttccactataaccccaaactctgcttgtcagaaatccacaagatggaagaagtttcaggaaccaaggggcgccaggagagaaacgacattgccctgaagaccaatggggacaaggcatcctgtgaaaatgagttacttaaattttcttacattcggacatcttttgacaagatcttgctgagatgggagccgtactggccccccgacttccgagacctcttggggttcatgctgttctacaaagaggccccttatcagaatgtgacggagttcgatgggcaggatgcgtgtggttccaacagttggacggtggtagacattgacccacccctgaggtccaacgaccccaaatcacagaaccacccagggtggctgatgcggggtctcaagccctggacccagtatgccatctttgtgaagaccctggtcaccttttcggatgaacgccggacctatggggccaagagtgacatcatttatgtccagacagatgccaccaacccctctgtgcccctggatccaatctcagtgtctaactcatcatcccagattattctgaagtggaaaccaccctccgaccccaatggcaacatcacccactacctggttttctgggagaggcaggcggaagacagtgagctgttcgagctggattattgcctcaaagggctgaagctgccctcgaggacctggtctccaccattcgagtctgaagattctcagaagcacaaccagagtgagtatgaggattcggccggcgaatgctgctcctgtccaaagacagactctcagatcctgaaggagctggaggagtcctcgtttaggaagacgtttgaggattacctgcacaacgtggttttcgtccccagaaaaacctcttcaggcactggtgccgaggaccctaggccatctcggaaacgcaggtcccttggcgatgttgggaatgtgacggtggccgtgcccacggtggcagctttccccaacacttcctcgaccagcgtgcccacgagtccggaggagcacaggccttttgagaaggtggtgaacaaggagtcgctggtcatctccggcttgcgacacttcacgggctatcgcatcgagctgcaggcttgcaaccaggacacccctgaggaacggtgcagtgtggcagcctacgtcagtgcgaggaccatgcctgaagccaaggctgatgacattgttggccctgtgacgcatgaaatctttgagaacaacgtcgtccacttgatgtggcaggagccgaaggagcccaatggtctgatcgtgctgtatgaagtgagttatcggcgatatggtgatgaggagctgcatctctgcgtctcccgcaagcacttcgctctggaacggggctgcaggctgcgtgggctgtcaccggggaactacagcgtgcgaatccgggccacctcccttgcgggcaacggctcttggacggaacccacctatttctacgtgacagactatttagacgtcccgtcaaatattgcaaaaattatcatcggccccctcatctttgtctttctcttcagtgttgtgattggaagtatttatctattcctgagaaagaggcagccagatgggccgctgggaccgctttacgcttcttcaaaccctgagtatctcagtgccagtgatgtgtttccatgctctgtgtacgtgccggacgagtgggaggtgtctcgagagaagatcaccctccttcgagagctggggcagggctccttcggcatggtgtatgagggcaatgccagggacatcatcaagggtgaggcagagacccgcgtggcggtgaagacggtcaacgagtcagccagtctccgagagcggattgagttcctcaatgaggcctcggtcatgaagggcttcacctgccatcacgtggtgcgcctcctgggagtggtgtccaagggccagcccacgctggtggtgatggagctgatggctcacggagacctgaagagctacctccgttctctgcggccagaggctgagaataatcctggccgccctccccctacccttcaagagatgattcagatggcggcagagattgctgacgggatggcctacctgaacgccaagaagtttgtgcatcgggacctggcagcgagaaactgcatggtcgcccatgattttactgtcaaaattggagactttggaatgaccagagacatctatgaaacggattactaccggaaagggggcaagggtctgctccctgtacggtggatggcaccggagtccctgaaggatggggtcttcaccacttcttctgacatgtggtcctttggcgtggtcctttgggaaatcaccagcttggcagaacagccttaccaaggcctgtctaatgaacaggtgttgaaatttgtcatggatggagggtatctggatcaacccgacaactgtccagagagagtcactgacctcatgcgcatgtgctggcaattcaaccccaagatgaggccaaccttcctggagattgtcaacctgctcaaggacgacctgcaccccagctttccagaggtgtcgttcttccacagcgaggagaacaaggctcccgagagtgaggagctggagatggagtttgaggacatggagaatgtgcccctggaccgttcctcgcactgtcagagggaggaggcggggggccgggatggagggtcctcgctgggtttcaagcggagctacgaggaacacatcccttacacacacatgaacggaggcaagaaaaacgggcggattctgaccttgcctcggtccaatccttcctaa'

Produrre nella variabile cds_sequence_fasta la sequenza della CDS in formato FASTA con il seguente header:

>M10051-HUM; len = 4149

In [33]:
header = '>' + identifier + '-' + organism + '; len = ' + str(len(cds_sequence))
cds_sequence_fasta = format_fasta(header, cds_sequence)

In [34]:
print(cds_sequence_fasta)


>M10051-HUM; len = 4149
atgggcaccgggggccggcggggggcggcggccgcgccgctgctggtggcggtggccgcgctgctactgggcgccgcggg
ccacctgtaccccggagaggtgtgtcccggcatggatatccggaacaacctcactaggttgcatgagctggagaattgct
ctgtcatcgaaggacacttgcagatactcttgatgttcaaaacgaggcccgaagatttccgagacctcagtttccccaaa
ctcatcatgatcactgattacttgctgctcttccgggtctatgggctcgagagcctgaaggacctgttccccaacctcac
ggtcatccggggatcacgactgttctttaactacgcgctggtcatcttcgagatggttcacctcaaggaactcggcctct
acaacctgatgaacatcacccggggttctgtccgcatcgagaagaacaatgagctctgttacttggccactatcgactgg
tcccgtatcctggattccgtggaggataatcacatcgtgttgaacaaagatgacaacgaggagtgtggagacatctgtcc
gggtaccgcgaagggcaagaccaactgccccgccaccgtcatcaacgggcagtttgtcgaacgatgttggactcatagtc
actgccagaaagtttgcccgaccatctgtaagtcacacggctgcaccgccgaaggcctctgttgccacagcgagtgcctg
ggcaactgttctcagcccgacgaccccaccaagtgcgtggcctgccgcaacttctacctggacggcaggtgtgtggagac
ctgcccgcccccgtactaccacttccaggactggcgctgtgtgaacttcagcttctgccaggacctgcaccacaaatgca
agaactcgcggaggcagggctgccaccaatacgtcattcacaacaacaagtgcatccctgagtgtccctccgggtacacg
atgaattccagcaacttgctgtgcaccccatgcctgggtccctgtcccaaggtgtgccacctcctagaaggcgagaagac
catcgactcggtgacgtctgcccaggagctccgaggatgcaccgtcatcaacgggagtctgatcatcaacattcgaggag
gcaacaatctggcagctgagctagaagccaacctcggcctcattgaagaaatttcagggtatctaaaaatccgccgatcc
tacgctctggtgtcactttccttcttccggaagttacgtctgattcgaggagagaccttggaaattgggaactactcctt
ctatgccttggacaaccagaacctaaggcagctctgggactggagcaaacacaacctcaccaccactcaggggaaactct
tcttccactataaccccaaactctgcttgtcagaaatccacaagatggaagaagtttcaggaaccaaggggcgccaggag
agaaacgacattgccctgaagaccaatggggacaaggcatcctgtgaaaatgagttacttaaattttcttacattcggac
atcttttgacaagatcttgctgagatgggagccgtactggccccccgacttccgagacctcttggggttcatgctgttct
acaaagaggccccttatcagaatgtgacggagttcgatgggcaggatgcgtgtggttccaacagttggacggtggtagac
attgacccacccctgaggtccaacgaccccaaatcacagaaccacccagggtggctgatgcggggtctcaagccctggac
ccagtatgccatctttgtgaagaccctggtcaccttttcggatgaacgccggacctatggggccaagagtgacatcattt
atgtccagacagatgccaccaacccctctgtgcccctggatccaatctcagtgtctaactcatcatcccagattattctg
aagtggaaaccaccctccgaccccaatggcaacatcacccactacctggttttctgggagaggcaggcggaagacagtga
gctgttcgagctggattattgcctcaaagggctgaagctgccctcgaggacctggtctccaccattcgagtctgaagatt
ctcagaagcacaaccagagtgagtatgaggattcggccggcgaatgctgctcctgtccaaagacagactctcagatcctg
aaggagctggaggagtcctcgtttaggaagacgtttgaggattacctgcacaacgtggttttcgtccccagaaaaacctc
ttcaggcactggtgccgaggaccctaggccatctcggaaacgcaggtcccttggcgatgttgggaatgtgacggtggccg
tgcccacggtggcagctttccccaacacttcctcgaccagcgtgcccacgagtccggaggagcacaggccttttgagaag
gtggtgaacaaggagtcgctggtcatctccggcttgcgacacttcacgggctatcgcatcgagctgcaggcttgcaacca
ggacacccctgaggaacggtgcagtgtggcagcctacgtcagtgcgaggaccatgcctgaagccaaggctgatgacattg
ttggccctgtgacgcatgaaatctttgagaacaacgtcgtccacttgatgtggcaggagccgaaggagcccaatggtctg
atcgtgctgtatgaagtgagttatcggcgatatggtgatgaggagctgcatctctgcgtctcccgcaagcacttcgctct
ggaacggggctgcaggctgcgtgggctgtcaccggggaactacagcgtgcgaatccgggccacctcccttgcgggcaacg
gctcttggacggaacccacctatttctacgtgacagactatttagacgtcccgtcaaatattgcaaaaattatcatcggc
cccctcatctttgtctttctcttcagtgttgtgattggaagtatttatctattcctgagaaagaggcagccagatgggcc
gctgggaccgctttacgcttcttcaaaccctgagtatctcagtgccagtgatgtgtttccatgctctgtgtacgtgccgg
acgagtgggaggtgtctcgagagaagatcaccctccttcgagagctggggcagggctccttcggcatggtgtatgagggc
aatgccagggacatcatcaagggtgaggcagagacccgcgtggcggtgaagacggtcaacgagtcagccagtctccgaga
gcggattgagttcctcaatgaggcctcggtcatgaagggcttcacctgccatcacgtggtgcgcctcctgggagtggtgt
ccaagggccagcccacgctggtggtgatggagctgatggctcacggagacctgaagagctacctccgttctctgcggcca
gaggctgagaataatcctggccgccctccccctacccttcaagagatgattcagatggcggcagagattgctgacgggat
ggcctacctgaacgccaagaagtttgtgcatcgggacctggcagcgagaaactgcatggtcgcccatgattttactgtca
aaattggagactttggaatgaccagagacatctatgaaacggattactaccggaaagggggcaagggtctgctccctgta
cggtggatggcaccggagtccctgaaggatggggtcttcaccacttcttctgacatgtggtcctttggcgtggtcctttg
ggaaatcaccagcttggcagaacagccttaccaaggcctgtctaatgaacaggtgttgaaatttgtcatggatggagggt
atctggatcaacccgacaactgtccagagagagtcactgacctcatgcgcatgtgctggcaattcaaccccaagatgagg
ccaaccttcctggagattgtcaacctgctcaaggacgacctgcaccccagctttccagaggtgtcgttcttccacagcga
ggagaacaaggctcccgagagtgaggagctggagatggagtttgaggacatggagaatgtgcccctggaccgttcctcgc
actgtcagagggaggaggcggggggccgggatggagggtcctcgctgggtttcaagcggagctacgaggaacacatccct
tacacacacatgaacggaggcaagaaaaacgggcggattctgaccttgcctcggtccaatccttcctaa

Distribuzione della frequenza dei codoni

Ottenere in codon_list la lista dei codoni (stop codon escluso) della coding sequence.


In [35]:
codon_list = re.findall('\w{3}', cds_sequence)[:-1]

In [36]:
codon_list


Out[36]:
['atg',
 'ggc',
 'acc',
 'ggg',
 'ggc',
 'cgg',
 'cgg',
 'ggg',
 'gcg',
 'gcg',
 'gcc',
 'gcg',
 'ccg',
 'ctg',
 'ctg',
 'gtg',
 'gcg',
 'gtg',
 'gcc',
 'gcg',
 'ctg',
 'cta',
 'ctg',
 'ggc',
 'gcc',
 'gcg',
 'ggc',
 'cac',
 'ctg',
 'tac',
 'ccc',
 'gga',
 'gag',
 'gtg',
 'tgt',
 'ccc',
 'ggc',
 'atg',
 'gat',
 'atc',
 'cgg',
 'aac',
 'aac',
 'ctc',
 'act',
 'agg',
 'ttg',
 'cat',
 'gag',
 'ctg',
 'gag',
 'aat',
 'tgc',
 'tct',
 'gtc',
 'atc',
 'gaa',
 'gga',
 'cac',
 'ttg',
 'cag',
 'ata',
 'ctc',
 'ttg',
 'atg',
 'ttc',
 'aaa',
 'acg',
 'agg',
 'ccc',
 'gaa',
 'gat',
 'ttc',
 'cga',
 'gac',
 'ctc',
 'agt',
 'ttc',
 'ccc',
 'aaa',
 'ctc',
 'atc',
 'atg',
 'atc',
 'act',
 'gat',
 'tac',
 'ttg',
 'ctg',
 'ctc',
 'ttc',
 'cgg',
 'gtc',
 'tat',
 'ggg',
 'ctc',
 'gag',
 'agc',
 'ctg',
 'aag',
 'gac',
 'ctg',
 'ttc',
 'ccc',
 'aac',
 'ctc',
 'acg',
 'gtc',
 'atc',
 'cgg',
 'gga',
 'tca',
 'cga',
 'ctg',
 'ttc',
 'ttt',
 'aac',
 'tac',
 'gcg',
 'ctg',
 'gtc',
 'atc',
 'ttc',
 'gag',
 'atg',
 'gtt',
 'cac',
 'ctc',
 'aag',
 'gaa',
 'ctc',
 'ggc',
 'ctc',
 'tac',
 'aac',
 'ctg',
 'atg',
 'aac',
 'atc',
 'acc',
 'cgg',
 'ggt',
 'tct',
 'gtc',
 'cgc',
 'atc',
 'gag',
 'aag',
 'aac',
 'aat',
 'gag',
 'ctc',
 'tgt',
 'tac',
 'ttg',
 'gcc',
 'act',
 'atc',
 'gac',
 'tgg',
 'tcc',
 'cgt',
 'atc',
 'ctg',
 'gat',
 'tcc',
 'gtg',
 'gag',
 'gat',
 'aat',
 'cac',
 'atc',
 'gtg',
 'ttg',
 'aac',
 'aaa',
 'gat',
 'gac',
 'aac',
 'gag',
 'gag',
 'tgt',
 'gga',
 'gac',
 'atc',
 'tgt',
 'ccg',
 'ggt',
 'acc',
 'gcg',
 'aag',
 'ggc',
 'aag',
 'acc',
 'aac',
 'tgc',
 'ccc',
 'gcc',
 'acc',
 'gtc',
 'atc',
 'aac',
 'ggg',
 'cag',
 'ttt',
 'gtc',
 'gaa',
 'cga',
 'tgt',
 'tgg',
 'act',
 'cat',
 'agt',
 'cac',
 'tgc',
 'cag',
 'aaa',
 'gtt',
 'tgc',
 'ccg',
 'acc',
 'atc',
 'tgt',
 'aag',
 'tca',
 'cac',
 'ggc',
 'tgc',
 'acc',
 'gcc',
 'gaa',
 'ggc',
 'ctc',
 'tgt',
 'tgc',
 'cac',
 'agc',
 'gag',
 'tgc',
 'ctg',
 'ggc',
 'aac',
 'tgt',
 'tct',
 'cag',
 'ccc',
 'gac',
 'gac',
 'ccc',
 'acc',
 'aag',
 'tgc',
 'gtg',
 'gcc',
 'tgc',
 'cgc',
 'aac',
 'ttc',
 'tac',
 'ctg',
 'gac',
 'ggc',
 'agg',
 'tgt',
 'gtg',
 'gag',
 'acc',
 'tgc',
 'ccg',
 'ccc',
 'ccg',
 'tac',
 'tac',
 'cac',
 'ttc',
 'cag',
 'gac',
 'tgg',
 'cgc',
 'tgt',
 'gtg',
 'aac',
 'ttc',
 'agc',
 'ttc',
 'tgc',
 'cag',
 'gac',
 'ctg',
 'cac',
 'cac',
 'aaa',
 'tgc',
 'aag',
 'aac',
 'tcg',
 'cgg',
 'agg',
 'cag',
 'ggc',
 'tgc',
 'cac',
 'caa',
 'tac',
 'gtc',
 'att',
 'cac',
 'aac',
 'aac',
 'aag',
 'tgc',
 'atc',
 'cct',
 'gag',
 'tgt',
 'ccc',
 'tcc',
 'ggg',
 'tac',
 'acg',
 'atg',
 'aat',
 'tcc',
 'agc',
 'aac',
 'ttg',
 'ctg',
 'tgc',
 'acc',
 'cca',
 'tgc',
 'ctg',
 'ggt',
 'ccc',
 'tgt',
 'ccc',
 'aag',
 'gtg',
 'tgc',
 'cac',
 'ctc',
 'cta',
 'gaa',
 'ggc',
 'gag',
 'aag',
 'acc',
 'atc',
 'gac',
 'tcg',
 'gtg',
 'acg',
 'tct',
 'gcc',
 'cag',
 'gag',
 'ctc',
 'cga',
 'gga',
 'tgc',
 'acc',
 'gtc',
 'atc',
 'aac',
 'ggg',
 'agt',
 'ctg',
 'atc',
 'atc',
 'aac',
 'att',
 'cga',
 'gga',
 'ggc',
 'aac',
 'aat',
 'ctg',
 'gca',
 'gct',
 'gag',
 'cta',
 'gaa',
 'gcc',
 'aac',
 'ctc',
 'ggc',
 'ctc',
 'att',
 'gaa',
 'gaa',
 'att',
 'tca',
 'ggg',
 'tat',
 'cta',
 'aaa',
 'atc',
 'cgc',
 'cga',
 'tcc',
 'tac',
 'gct',
 'ctg',
 'gtg',
 'tca',
 'ctt',
 'tcc',
 'ttc',
 'ttc',
 'cgg',
 'aag',
 'tta',
 'cgt',
 'ctg',
 'att',
 'cga',
 'gga',
 'gag',
 'acc',
 'ttg',
 'gaa',
 'att',
 'ggg',
 'aac',
 'tac',
 'tcc',
 'ttc',
 'tat',
 'gcc',
 'ttg',
 'gac',
 'aac',
 'cag',
 'aac',
 'cta',
 'agg',
 'cag',
 'ctc',
 'tgg',
 'gac',
 'tgg',
 'agc',
 'aaa',
 'cac',
 'aac',
 'ctc',
 'acc',
 'acc',
 'act',
 'cag',
 'ggg',
 'aaa',
 'ctc',
 'ttc',
 'ttc',
 'cac',
 'tat',
 'aac',
 'ccc',
 'aaa',
 'ctc',
 'tgc',
 'ttg',
 'tca',
 'gaa',
 'atc',
 'cac',
 'aag',
 'atg',
 'gaa',
 'gaa',
 'gtt',
 'tca',
 'gga',
 'acc',
 'aag',
 'ggg',
 'cgc',
 'cag',
 'gag',
 'aga',
 'aac',
 'gac',
 'att',
 'gcc',
 'ctg',
 'aag',
 'acc',
 'aat',
 'ggg',
 'gac',
 'aag',
 'gca',
 'tcc',
 'tgt',
 'gaa',
 'aat',
 'gag',
 'tta',
 'ctt',
 'aaa',
 'ttt',
 'tct',
 'tac',
 'att',
 'cgg',
 'aca',
 'tct',
 'ttt',
 'gac',
 'aag',
 'atc',
 'ttg',
 'ctg',
 'aga',
 'tgg',
 'gag',
 'ccg',
 'tac',
 'tgg',
 'ccc',
 'ccc',
 'gac',
 'ttc',
 'cga',
 'gac',
 'ctc',
 'ttg',
 'ggg',
 'ttc',
 'atg',
 'ctg',
 'ttc',
 'tac',
 'aaa',
 'gag',
 'gcc',
 'cct',
 'tat',
 'cag',
 'aat',
 'gtg',
 'acg',
 'gag',
 'ttc',
 'gat',
 'ggg',
 'cag',
 'gat',
 'gcg',
 'tgt',
 'ggt',
 'tcc',
 'aac',
 'agt',
 'tgg',
 'acg',
 'gtg',
 'gta',
 'gac',
 'att',
 'gac',
 'cca',
 'ccc',
 'ctg',
 'agg',
 'tcc',
 'aac',
 'gac',
 'ccc',
 'aaa',
 'tca',
 'cag',
 'aac',
 'cac',
 'cca',
 'ggg',
 'tgg',
 'ctg',
 'atg',
 'cgg',
 'ggt',
 'ctc',
 'aag',
 'ccc',
 'tgg',
 'acc',
 'cag',
 'tat',
 'gcc',
 'atc',
 'ttt',
 'gtg',
 'aag',
 'acc',
 'ctg',
 'gtc',
 'acc',
 'ttt',
 'tcg',
 'gat',
 'gaa',
 'cgc',
 'cgg',
 'acc',
 'tat',
 'ggg',
 'gcc',
 'aag',
 'agt',
 'gac',
 'atc',
 'att',
 'tat',
 'gtc',
 'cag',
 'aca',
 'gat',
 'gcc',
 'acc',
 'aac',
 'ccc',
 'tct',
 'gtg',
 'ccc',
 'ctg',
 'gat',
 'cca',
 'atc',
 'tca',
 'gtg',
 'tct',
 'aac',
 'tca',
 'tca',
 'tcc',
 'cag',
 'att',
 'att',
 'ctg',
 'aag',
 'tgg',
 'aaa',
 'cca',
 'ccc',
 'tcc',
 'gac',
 'ccc',
 'aat',
 'ggc',
 'aac',
 'atc',
 'acc',
 'cac',
 'tac',
 'ctg',
 'gtt',
 'ttc',
 'tgg',
 'gag',
 'agg',
 'cag',
 'gcg',
 'gaa',
 'gac',
 'agt',
 'gag',
 'ctg',
 'ttc',
 'gag',
 'ctg',
 'gat',
 'tat',
 'tgc',
 'ctc',
 'aaa',
 'ggg',
 'ctg',
 'aag',
 'ctg',
 'ccc',
 'tcg',
 'agg',
 'acc',
 'tgg',
 'tct',
 'cca',
 'cca',
 'ttc',
 'gag',
 'tct',
 'gaa',
 'gat',
 'tct',
 'cag',
 'aag',
 'cac',
 'aac',
 'cag',
 'agt',
 'gag',
 'tat',
 'gag',
 'gat',
 'tcg',
 'gcc',
 'ggc',
 'gaa',
 'tgc',
 'tgc',
 'tcc',
 'tgt',
 'cca',
 'aag',
 'aca',
 'gac',
 'tct',
 'cag',
 'atc',
 'ctg',
 'aag',
 'gag',
 'ctg',
 'gag',
 'gag',
 'tcc',
 'tcg',
 'ttt',
 'agg',
 'aag',
 'acg',
 'ttt',
 'gag',
 'gat',
 'tac',
 'ctg',
 'cac',
 'aac',
 'gtg',
 'gtt',
 'ttc',
 'gtc',
 'ccc',
 'aga',
 'aaa',
 'acc',
 'tct',
 'tca',
 'ggc',
 'act',
 'ggt',
 'gcc',
 'gag',
 'gac',
 'cct',
 'agg',
 'cca',
 'tct',
 'cgg',
 'aaa',
 'cgc',
 'agg',
 'tcc',
 'ctt',
 'ggc',
 'gat',
 'gtt',
 'ggg',
 'aat',
 'gtg',
 'acg',
 'gtg',
 'gcc',
 'gtg',
 'ccc',
 'acg',
 'gtg',
 'gca',
 'gct',
 'ttc',
 'ccc',
 'aac',
 'act',
 'tcc',
 'tcg',
 'acc',
 'agc',
 'gtg',
 'ccc',
 'acg',
 'agt',
 'ccg',
 'gag',
 'gag',
 'cac',
 'agg',
 'cct',
 'ttt',
 'gag',
 'aag',
 'gtg',
 'gtg',
 'aac',
 'aag',
 'gag',
 'tcg',
 'ctg',
 'gtc',
 'atc',
 'tcc',
 'ggc',
 'ttg',
 'cga',
 'cac',
 'ttc',
 'acg',
 'ggc',
 'tat',
 'cgc',
 'atc',
 'gag',
 'ctg',
 'cag',
 'gct',
 'tgc',
 'aac',
 'cag',
 'gac',
 'acc',
 'cct',
 'gag',
 'gaa',
 'cgg',
 'tgc',
 'agt',
 'gtg',
 'gca',
 'gcc',
 'tac',
 'gtc',
 'agt',
 'gcg',
 'agg',
 'acc',
 'atg',
 'cct',
 'gaa',
 'gcc',
 'aag',
 'gct',
 'gat',
 'gac',
 'att',
 'gtt',
 'ggc',
 'cct',
 'gtg',
 'acg',
 'cat',
 'gaa',
 'atc',
 'ttt',
 'gag',
 'aac',
 'aac',
 'gtc',
 'gtc',
 'cac',
 'ttg',
 'atg',
 'tgg',
 'cag',
 'gag',
 'ccg',
 'aag',
 'gag',
 'ccc',
 'aat',
 'ggt',
 'ctg',
 'atc',
 'gtg',
 'ctg',
 'tat',
 'gaa',
 'gtg',
 'agt',
 'tat',
 'cgg',
 'cga',
 'tat',
 'ggt',
 'gat',
 'gag',
 'gag',
 'ctg',
 'cat',
 'ctc',
 'tgc',
 'gtc',
 'tcc',
 'cgc',
 'aag',
 'cac',
 'ttc',
 'gct',
 'ctg',
 'gaa',
 'cgg',
 'ggc',
 'tgc',
 'agg',
 'ctg',
 'cgt',
 'ggg',
 'ctg',
 'tca',
 'ccg',
 'ggg',
 'aac',
 'tac',
 'agc',
 'gtg',
 'cga',
 'atc',
 'cgg',
 'gcc',
 'acc',
 'tcc',
 'ctt',
 'gcg',
 'ggc',
 'aac',
 'ggc',
 'tct',
 'tgg',
 'acg',
 'gaa',
 'ccc',
 'acc',
 'tat',
 'ttc',
 'tac',
 'gtg',
 'aca',
 'gac',
 'tat',
 'tta',
 'gac',
 'gtc',
 'ccg',
 'tca',
 'aat',
 'att',
 'gca',
 'aaa',
 'att',
 'atc',
 'atc',
 'ggc',
 'ccc',
 'ctc',
 'atc',
 'ttt',
 'gtc',
 'ttt',
 'ctc',
 'ttc',
 'agt',
 'gtt',
 'gtg',
 'att',
 'gga',
 'agt',
 'att',
 'tat',
 'cta',
 'ttc',
 'ctg',
 'aga',
 'aag',
 'agg',
 'cag',
 'cca',
 'gat',
 'ggg',
 'ccg',
 'ctg',
 'gga',
 'ccg',
 'ctt',
 'tac',
 'gct',
 'tct',
 'tca',
 'aac',
 'cct',
 'gag',
 'tat',
 'ctc',
 ...]

Costruire la lista codon_frequency di tuple (codon, frequency) elencate per frequenza decrescente.


In [37]:
from collections import Counter

codon_counter = Counter(codon_list)
codon_frequency = codon_counter.most_common()

In [38]:
codon_frequency


Out[38]:
[('gag', 77),
 ('ctg', 67),
 ('aac', 53),
 ('gtg', 49),
 ('aag', 47),
 ('gac', 44),
 ('atc', 41),
 ('acc', 39),
 ('ttc', 39),
 ('ggc', 37),
 ('ccc', 37),
 ('ctc', 35),
 ('cag', 32),
 ('cac', 31),
 ('atg', 30),
 ('tgc', 30),
 ('tac', 29),
 ('gaa', 29),
 ('ggg', 28),
 ('gcc', 28),
 ('gtc', 28),
 ('tcc', 27),
 ('gat', 26),
 ('cgg', 25),
 ('att', 23),
 ('tct', 22),
 ('aaa', 21),
 ('tat', 21),
 ('ttt', 20),
 ('tgg', 20),
 ('aat', 19),
 ('agg', 18),
 ('gga', 17),
 ('tgt', 17),
 ('ttg', 17),
 ('agt', 17),
 ('gcg', 16),
 ('acg', 16),
 ('cct', 16),
 ('tca', 15),
 ('cca', 15),
 ('ccg', 14),
 ('cga', 14),
 ('cgc', 13),
 ('agc', 12),
 ('tcg', 12),
 ('ggt', 11),
 ('gct', 11),
 ('act', 10),
 ('gca', 10),
 ('gtt', 8),
 ('ctt', 8),
 ('cat', 7),
 ('aga', 7),
 ('cta', 6),
 ('cgt', 5),
 ('caa', 5),
 ('aca', 5),
 ('tta', 3),
 ('gta', 2),
 ('ata', 1)]

Distribuzione della frequenza degli amminoacidi

Ottenere in ammino_list la lista degli amminoacidi della proteina.


In [39]:
ammino_list = re.findall('\w', protein_sequence)

In [40]:
ammino_list


Out[40]:
['M',
 'G',
 'T',
 'G',
 'G',
 'R',
 'R',
 'G',
 'A',
 'A',
 'A',
 'A',
 'P',
 'L',
 'L',
 'V',
 'A',
 'V',
 'A',
 'A',
 'L',
 'L',
 'L',
 'G',
 'A',
 'A',
 'G',
 'H',
 'L',
 'Y',
 'P',
 'G',
 'E',
 'V',
 'C',
 'P',
 'G',
 'M',
 'D',
 'I',
 'R',
 'N',
 'N',
 'L',
 'T',
 'R',
 'L',
 'H',
 'E',
 'L',
 'E',
 'N',
 'C',
 'S',
 'V',
 'I',
 'E',
 'G',
 'H',
 'L',
 'Q',
 'I',
 'L',
 'L',
 'M',
 'F',
 'K',
 'T',
 'R',
 'P',
 'E',
 'D',
 'F',
 'R',
 'D',
 'L',
 'S',
 'F',
 'P',
 'K',
 'L',
 'I',
 'M',
 'I',
 'T',
 'D',
 'Y',
 'L',
 'L',
 'L',
 'F',
 'R',
 'V',
 'Y',
 'G',
 'L',
 'E',
 'S',
 'L',
 'K',
 'D',
 'L',
 'F',
 'P',
 'N',
 'L',
 'T',
 'V',
 'I',
 'R',
 'G',
 'S',
 'R',
 'L',
 'F',
 'F',
 'N',
 'Y',
 'A',
 'L',
 'V',
 'I',
 'F',
 'E',
 'M',
 'V',
 'H',
 'L',
 'K',
 'E',
 'L',
 'G',
 'L',
 'Y',
 'N',
 'L',
 'M',
 'N',
 'I',
 'T',
 'R',
 'G',
 'S',
 'V',
 'R',
 'I',
 'E',
 'K',
 'N',
 'N',
 'E',
 'L',
 'C',
 'Y',
 'L',
 'A',
 'T',
 'I',
 'D',
 'W',
 'S',
 'R',
 'I',
 'L',
 'D',
 'S',
 'V',
 'E',
 'D',
 'N',
 'H',
 'I',
 'V',
 'L',
 'N',
 'K',
 'D',
 'D',
 'N',
 'E',
 'E',
 'C',
 'G',
 'D',
 'I',
 'C',
 'P',
 'G',
 'T',
 'A',
 'K',
 'G',
 'K',
 'T',
 'N',
 'C',
 'P',
 'A',
 'T',
 'V',
 'I',
 'N',
 'G',
 'Q',
 'F',
 'V',
 'E',
 'R',
 'C',
 'W',
 'T',
 'H',
 'S',
 'H',
 'C',
 'Q',
 'K',
 'V',
 'C',
 'P',
 'T',
 'I',
 'C',
 'K',
 'S',
 'H',
 'G',
 'C',
 'T',
 'A',
 'E',
 'G',
 'L',
 'C',
 'C',
 'H',
 'S',
 'E',
 'C',
 'L',
 'G',
 'N',
 'C',
 'S',
 'Q',
 'P',
 'D',
 'D',
 'P',
 'T',
 'K',
 'C',
 'V',
 'A',
 'C',
 'R',
 'N',
 'F',
 'Y',
 'L',
 'D',
 'G',
 'R',
 'C',
 'V',
 'E',
 'T',
 'C',
 'P',
 'P',
 'P',
 'Y',
 'Y',
 'H',
 'F',
 'Q',
 'D',
 'W',
 'R',
 'C',
 'V',
 'N',
 'F',
 'S',
 'F',
 'C',
 'Q',
 'D',
 'L',
 'H',
 'H',
 'K',
 'C',
 'K',
 'N',
 'S',
 'R',
 'R',
 'Q',
 'G',
 'C',
 'H',
 'Q',
 'Y',
 'V',
 'I',
 'H',
 'N',
 'N',
 'K',
 'C',
 'I',
 'P',
 'E',
 'C',
 'P',
 'S',
 'G',
 'Y',
 'T',
 'M',
 'N',
 'S',
 'S',
 'N',
 'L',
 'L',
 'C',
 'T',
 'P',
 'C',
 'L',
 'G',
 'P',
 'C',
 'P',
 'K',
 'V',
 'C',
 'H',
 'L',
 'L',
 'E',
 'G',
 'E',
 'K',
 'T',
 'I',
 'D',
 'S',
 'V',
 'T',
 'S',
 'A',
 'Q',
 'E',
 'L',
 'R',
 'G',
 'C',
 'T',
 'V',
 'I',
 'N',
 'G',
 'S',
 'L',
 'I',
 'I',
 'N',
 'I',
 'R',
 'G',
 'G',
 'N',
 'N',
 'L',
 'A',
 'A',
 'E',
 'L',
 'E',
 'A',
 'N',
 'L',
 'G',
 'L',
 'I',
 'E',
 'E',
 'I',
 'S',
 'G',
 'Y',
 'L',
 'K',
 'I',
 'R',
 'R',
 'S',
 'Y',
 'A',
 'L',
 'V',
 'S',
 'L',
 'S',
 'F',
 'F',
 'R',
 'K',
 'L',
 'R',
 'L',
 'I',
 'R',
 'G',
 'E',
 'T',
 'L',
 'E',
 'I',
 'G',
 'N',
 'Y',
 'S',
 'F',
 'Y',
 'A',
 'L',
 'D',
 'N',
 'Q',
 'N',
 'L',
 'R',
 'Q',
 'L',
 'W',
 'D',
 'W',
 'S',
 'K',
 'H',
 'N',
 'L',
 'T',
 'T',
 'T',
 'Q',
 'G',
 'K',
 'L',
 'F',
 'F',
 'H',
 'Y',
 'N',
 'P',
 'K',
 'L',
 'C',
 'L',
 'S',
 'E',
 'I',
 'H',
 'K',
 'M',
 'E',
 'E',
 'V',
 'S',
 'G',
 'T',
 'K',
 'G',
 'R',
 'Q',
 'E',
 'R',
 'N',
 'D',
 'I',
 'A',
 'L',
 'K',
 'T',
 'N',
 'G',
 'D',
 'K',
 'A',
 'S',
 'C',
 'E',
 'N',
 'E',
 'L',
 'L',
 'K',
 'F',
 'S',
 'Y',
 'I',
 'R',
 'T',
 'S',
 'F',
 'D',
 'K',
 'I',
 'L',
 'L',
 'R',
 'W',
 'E',
 'P',
 'Y',
 'W',
 'P',
 'P',
 'D',
 'F',
 'R',
 'D',
 'L',
 'L',
 'G',
 'F',
 'M',
 'L',
 'F',
 'Y',
 'K',
 'E',
 'A',
 'P',
 'Y',
 'Q',
 'N',
 'V',
 'T',
 'E',
 'F',
 'D',
 'G',
 'Q',
 'D',
 'A',
 'C',
 'G',
 'S',
 'N',
 'S',
 'W',
 'T',
 'V',
 'V',
 'D',
 'I',
 'D',
 'P',
 'P',
 'L',
 'R',
 'S',
 'N',
 'D',
 'P',
 'K',
 'S',
 'Q',
 'N',
 'H',
 'P',
 'G',
 'W',
 'L',
 'M',
 'R',
 'G',
 'L',
 'K',
 'P',
 'W',
 'T',
 'Q',
 'Y',
 'A',
 'I',
 'F',
 'V',
 'K',
 'T',
 'L',
 'V',
 'T',
 'F',
 'S',
 'D',
 'E',
 'R',
 'R',
 'T',
 'Y',
 'G',
 'A',
 'K',
 'S',
 'D',
 'I',
 'I',
 'Y',
 'V',
 'Q',
 'T',
 'D',
 'A',
 'T',
 'N',
 'P',
 'S',
 'V',
 'P',
 'L',
 'D',
 'P',
 'I',
 'S',
 'V',
 'S',
 'N',
 'S',
 'S',
 'S',
 'Q',
 'I',
 'I',
 'L',
 'K',
 'W',
 'K',
 'P',
 'P',
 'S',
 'D',
 'P',
 'N',
 'G',
 'N',
 'I',
 'T',
 'H',
 'Y',
 'L',
 'V',
 'F',
 'W',
 'E',
 'R',
 'Q',
 'A',
 'E',
 'D',
 'S',
 'E',
 'L',
 'F',
 'E',
 'L',
 'D',
 'Y',
 'C',
 'L',
 'K',
 'G',
 'L',
 'K',
 'L',
 'P',
 'S',
 'R',
 'T',
 'W',
 'S',
 'P',
 'P',
 'F',
 'E',
 'S',
 'E',
 'D',
 'S',
 'Q',
 'K',
 'H',
 'N',
 'Q',
 'S',
 'E',
 'Y',
 'E',
 'D',
 'S',
 'A',
 'G',
 'E',
 'C',
 'C',
 'S',
 'C',
 'P',
 'K',
 'T',
 'D',
 'S',
 'Q',
 'I',
 'L',
 'K',
 'E',
 'L',
 'E',
 'E',
 'S',
 'S',
 'F',
 'R',
 'K',
 'T',
 'F',
 'E',
 'D',
 'Y',
 'L',
 'H',
 'N',
 'V',
 'V',
 'F',
 'V',
 'P',
 'R',
 'K',
 'T',
 'S',
 'S',
 'G',
 'T',
 'G',
 'A',
 'E',
 'D',
 'P',
 'R',
 'P',
 'S',
 'R',
 'K',
 'R',
 'R',
 'S',
 'L',
 'G',
 'D',
 'V',
 'G',
 'N',
 'V',
 'T',
 'V',
 'A',
 'V',
 'P',
 'T',
 'V',
 'A',
 'A',
 'F',
 'P',
 'N',
 'T',
 'S',
 'S',
 'T',
 'S',
 'V',
 'P',
 'T',
 'S',
 'P',
 'E',
 'E',
 'H',
 'R',
 'P',
 'F',
 'E',
 'K',
 'V',
 'V',
 'N',
 'K',
 'E',
 'S',
 'L',
 'V',
 'I',
 'S',
 'G',
 'L',
 'R',
 'H',
 'F',
 'T',
 'G',
 'Y',
 'R',
 'I',
 'E',
 'L',
 'Q',
 'A',
 'C',
 'N',
 'Q',
 'D',
 'T',
 'P',
 'E',
 'E',
 'R',
 'C',
 'S',
 'V',
 'A',
 'A',
 'Y',
 'V',
 'S',
 'A',
 'R',
 'T',
 'M',
 'P',
 'E',
 'A',
 'K',
 'A',
 'D',
 'D',
 'I',
 'V',
 'G',
 'P',
 'V',
 'T',
 'H',
 'E',
 'I',
 'F',
 'E',
 'N',
 'N',
 'V',
 'V',
 'H',
 'L',
 'M',
 'W',
 'Q',
 'E',
 'P',
 'K',
 'E',
 'P',
 'N',
 'G',
 'L',
 'I',
 'V',
 'L',
 'Y',
 'E',
 'V',
 'S',
 'Y',
 'R',
 'R',
 'Y',
 'G',
 'D',
 'E',
 'E',
 'L',
 'H',
 'L',
 'C',
 'V',
 'S',
 'R',
 'K',
 'H',
 'F',
 'A',
 'L',
 'E',
 'R',
 'G',
 'C',
 'R',
 'L',
 'R',
 'G',
 'L',
 'S',
 'P',
 'G',
 'N',
 'Y',
 'S',
 'V',
 'R',
 'I',
 'R',
 'A',
 'T',
 'S',
 'L',
 'A',
 'G',
 'N',
 'G',
 'S',
 'W',
 'T',
 'E',
 'P',
 'T',
 'Y',
 'F',
 'Y',
 'V',
 'T',
 'D',
 'Y',
 'L',
 'D',
 'V',
 'P',
 'S',
 'N',
 'I',
 'A',
 'K',
 'I',
 'I',
 'I',
 'G',
 'P',
 'L',
 'I',
 'F',
 'V',
 'F',
 'L',
 'F',
 'S',
 'V',
 'V',
 'I',
 'G',
 'S',
 'I',
 'Y',
 'L',
 'F',
 'L',
 'R',
 'K',
 'R',
 'Q',
 'P',
 'D',
 'G',
 'P',
 'L',
 'G',
 'P',
 'L',
 'Y',
 'A',
 'S',
 'S',
 'N',
 'P',
 'E',
 'Y',
 'L',
 ...]

Costruire la lista ammino_frequency di tuple (amminoacid, frequency) elencate per frequenza decrescente.


In [41]:
from collections import Counter

ammino_counter = Counter(ammino_list)
ammino_frequency = ammino_counter.most_common()

In [42]:
ammino_frequency


Out[42]:
[('L', 136),
 ('E', 106),
 ('S', 105),
 ('G', 93),
 ('V', 87),
 ('R', 82),
 ('P', 82),
 ('N', 72),
 ('T', 70),
 ('D', 70),
 ('K', 68),
 ('A', 65),
 ('I', 65),
 ('F', 59),
 ('Y', 50),
 ('C', 47),
 ('H', 38),
 ('Q', 37),
 ('M', 30),
 ('W', 20)]

Validazione della sequenza della proteina

Ottenere nella variabile cds_translation la traduzione della CDS tramite il codice genetico.


In [43]:
cds_translation = ''.join([genetic_code_dict[codon] for codon in codon_list])

In [44]:
cds_translation


Out[44]:
'MGTGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPGMDIRNNLTRLHELENCSVIEGHLQILLMFKTRPEDFRDLSFPKLIMITDYLLLFRVYGLESLKDLFPNLTVIRGSRLFFNYALVIFEMVHLKELGLYNLMNITRGSVRIEKNNELCYLATIDWSRILDSVEDNHIVLNKDDNEECGDICPGTAKGKTNCPATVINGQFVERCWTHSHCQKVCPTICKSHGCTAEGLCCHSECLGNCSQPDDPTKCVACRNFYLDGRCVETCPPPYYHFQDWRCVNFSFCQDLHHKCKNSRRQGCHQYVIHNNKCIPECPSGYTMNSSNLLCTPCLGPCPKVCHLLEGEKTIDSVTSAQELRGCTVINGSLIINIRGGNNLAAELEANLGLIEEISGYLKIRRSYALVSLSFFRKLRLIRGETLEIGNYSFYALDNQNLRQLWDWSKHNLTTTQGKLFFHYNPKLCLSEIHKMEEVSGTKGRQERNDIALKTNGDKASCENELLKFSYIRTSFDKILLRWEPYWPPDFRDLLGFMLFYKEAPYQNVTEFDGQDACGSNSWTVVDIDPPLRSNDPKSQNHPGWLMRGLKPWTQYAIFVKTLVTFSDERRTYGAKSDIIYVQTDATNPSVPLDPISVSNSSSQIILKWKPPSDPNGNITHYLVFWERQAEDSELFELDYCLKGLKLPSRTWSPPFESEDSQKHNQSEYEDSAGECCSCPKTDSQILKELEESSFRKTFEDYLHNVVFVPRKTSSGTGAEDPRPSRKRRSLGDVGNVTVAVPTVAAFPNTSSTSVPTSPEEHRPFEKVVNKESLVISGLRHFTGYRIELQACNQDTPEERCSVAAYVSARTMPEAKADDIVGPVTHEIFENNVVHLMWQEPKEPNGLIVLYEVSYRRYGDEELHLCVSRKHFALERGCRLRGLSPGNYSVRIRATSLAGNGSWTEPTYFYVTDYLDVPSNIAKIIIGPLIFVFLFSVVIGSIYLFLRKRQPDGPLGPLYASSNPEYLSASDVFPCSVYVPDEWEVSREKITLLRELGQGSFGMVYEGNARDIIKGEAETRVAVKTVNESASLRERIEFLNEASVMKGFTCHHVVRLLGVVSKGQPTLVVMELMAHGDLKSYLRSLRPEAENNPGRPPPTLQEMIQMAAEIADGMAYLNAKKFVHRDLAARNCMVAHDFTVKIGDFGMTRDIYETDYYRKGGKGLLPVRWMAPESLKDGVFTTSSDMWSFGVVLWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDNCPERVTDLMRMCWQFNPKMRPTFLEIVNLLKDDLHPSFPEVSFFHSEENKAPESEELEMEFEDMENVPLDRSSHCQREEAGGRDGGSSLGFKRSYEEHIPYTHMNGGKKNGRILTLPRSNPS'

Verificare che la proteina estratta dal file in input sia uguale a quella ottenuta dalla traduzione della coding sequence dell'mRNA.


In [45]:
cds_translation == protein_sequence


Out[45]:
True