Esercizio 3

EMBL (http://www.ebi.ac.uk/cgi-bin/sva/sva.pl/) è una banca di sequenze nucleotidiche sviluppata da EMBL-EBI (European Bioinformatics Institute, European Molecular Biology Laboratory), in cui ogni sequenza nucleotidica viene memorizzata, con altre informazioni, in file di testo (entry EMBL) in un formato che prende il nome di formato EMBL.

Il formato EMBL è composto da record che iniziano con un codice a due lettere maisucole che specifica il contenuto del record. I soli record che non iniziano con il codice a due lettere sono quelli contenenti la sequenza nucleotidica.

Dato un file in formato EMBL, contenente la sequenza nucleotidica (sequenza di basi) di un mRNA (trascritto espresso da un gene), produrre:

  • la sequenza nucleotidica in formato FASTA
  • la sequenza della proteina espressa dal gene in formato FASTA

Parametri in input:

  • nome del file in formato EMBL

Dove trovare le informazioni che servono per risolvere l'esercizio:

  • Il record che inizia con ID

     ID   M10051; SV 1; linear; mRNA; STD; HUM; 4723 BP.

contiene l'identificatore univoco della sequenza (M10051) e l'organismo (HUM). Il fatto che il file si riferisca alla sequenza nucleotidica di un gene è indicato dalla presenza della parola mRNA.

  • L'insieme dei record che iniziano con FT sono quelli che contengono le features della sequenza nucleotidica. In particolare tutti i record della sezione:

     FT                   /translation="MGTGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPGMDIRNNLT
     FT                   RLHELENCSVIEGHLQILLMFKTRPEDFRDLSFPKLIMITDYLLLFRVYGLESLKDLFP
     FT                   NLTVIRGSRLFFNYALVIFEMVHLKELGLYNLMNITRGSVRIEKNNELCYLATIDWSRI
     FT                   LDSVEDNHIVLNKDDNEECGDICPGTAKGKTNCPATVINGQFVERCWTHSHCQKVCPTI
     FT                   [...]
     FT                   DGGSSLGFKRSYEEHIPYTHMNGGKKNGRILTLPRSNPS"

contengono la sequenza della proteina espressa dal gene.

  • Il record che inizia con SQ:

      SQ   Sequence 4723 BP; 1068 A; 1298 C; 1311 G; 1046 T; 0 other;

introduce la sezione della sequenza nucleotidica che termina con il record // (file del file). Ogni record contenente la sequenza nucleotidica inizia con una serie di spazi iniziali, e contiene un chunk di sequenza lungo 60 basi. L'intero alla fine del record fornisce la lunghezza totale dei chunks fino a tale record. Ogni chunk in un record viene poi separato in chunks più piccoli di 10 basi.

SQ   Sequence 4723 BP; 1068 A; 1298 C; 1311 G; 1046 T; 0 other;
     ggggggctgc gcggccgggt cggtgcgcac acgagaagga cgcgcggccc ccagcgctct        60
     tgggggccgc ctcggagcat gacccccgcg ggccagcgcc gcgcgcctga tccgaggaga       120
     ccccgcgctc ccgcagccat gggcaccggg ggccggcggg gggcggcggc cgcgccgctg       180
     ctggtggcgg tggccgcgct gctactgggc gccgcgggcc acctgtaccc cggagaggtg       240
     tgtcccggca tggatatccg gaacaacctc actaggttgc atgagctgga gaattgctct       300
     gtcatcgaag gacacttgca gatactcttg atgttcaaaa cgaggcccga agatttccga       360
     gacctcagtt tccccaaact catcatgatc actgattact tgctgctctt ccgggtctat       420
     gggctcgaga gcctgaagga cctgttcccc aacctcacgg tcatccgggg atcacgactg       480
     [...]
     tttttcgttc cccccacccg cccccagcag atggaaagaa agcacctgtt tttacaaatt      4620
     cttttttttt tttttttttt tttttttttg ctggtgtctg agcttcagta taaaagacaa      4680
     aacttcctgt ttgtggaaca aaatttcgaa agaaaaaacc aaa                        4723
//

NOTA BENE:

  • l'alfabeto degli amminoacidi è {ACDEFGHIKLMNPQRSTVWY}
  • la sequenza nucleotidica riportata nell'entry EMBL è su alfabeto {a,c,g,t} nonostante rappresenti la sequenza primaria di un mRNA. Per ottenere la sequenza su alfabeto {a,c,g,u} basta operare una sostituzione di tutti i simboli t con simboli u.

Requisiti:

  • nell’header della sequenza nucleotidica in formato FASTA devono comparire l’identificatore univoco della sequenza e l’organismo a cui si riferisce, nel seguente formato:

      >M10051-HUM
  • la sequenza nucleotidica deve essere prodotta su alfabeto {a,c,g,u}

  • nell’header della sequenza della proteina in formato FASTA devono comparire l’identificatore univoco della sequenza, l’organismo e la lunghezza della proteina, nel seguente formato:

      >M10051-HUM; len = 1382
  • le sequenze in formato FASTA devono essere prodotte in righe di 80 caratteri

  • deve essere definita una funzione format_fasta() che prenda come argomenti un'intestazione FASTA, una sequenza nucleotidica/proteica, e restituisca la sequenza in formato FASTA con la sequenza separata in righe di 80 caratteri.

  • usare solo espressioni regolari per estrarre le informazioni


Variabili di output:

  • nucleotide_sequence_fasta: sequenza nucleotidica in formato FASTA
  • protein_sequence_fasta: sequenza della proteina in formato FASTA

Soluzione

Definizione della funzione format_fasta()


In [1]:
def format_fasta(header, sequence):
    return header + '\n' + '\n'.join(re.findall('\w{1,80}', sequence))

NOTA BENE: supporre che l'header in input alla funzione non abbia il simbolo newline \n alla fine ma che abbia il simbolo > all'inizio.

Parametri in input


In [2]:
input_file_name = './M10051.txt'

Importazione del modulo re per utilizzare le espressioni regolari (RE).


In [3]:
import re

Lettura del file (entry EMBL) in un'unica stringa file_str


In [4]:
with open(input_file_name,'r') as input_file:
    file_str = input_file.read()

In [5]:
file_str


Out[5]:
'ID   M10051; SV 1; linear; mRNA; STD; HUM; 4723 BP.\nXX\nAC   M10051;\nXX\nDT   02-JUL-1986 (Rel. 09, Created)\nDT   14-NOV-2006 (Rel. 89, Last updated, Version 7)\nXX\nDE   Human insulin receptor mRNA, complete cds.\nXX\nKW   insulin receptor; tyrosine kinase.\nXX\nOS   Homo sapiens (human)\nOC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;\nOC   Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae;\nOC   Homo.\nXX\nRN   [1]\nRP   1-4723\nRX   DOI; 10.1016/0092-8674(85)90334-4.\nRX   PUBMED; 2859121.\nRA   Ebina Y., Ellis L., Jarnagin K., Edery M., Graf L., Clauser E., Ou J.-H.,\nRA   Masiarz F., Kan Y.W., Goldfine I.D., Roth R.A., Rutter W.J.;\nRT   "The human insulin receptor cDNA: the structural basis for\nRT   hormone-activated transmembrane signalling";\nRL   Cell 40(4):747-758(1985).\nXX\nDR   MD5; e4e6ebf2e723a500c1dd62385c279351.\nDR   Ensembl-Gn; ENSG00000171105; homo_sapiens.\nDR   Ensembl-Tr; ENST00000302850; homo_sapiens.\nDR   Ensembl-Tr; ENST00000341500; homo_sapiens.\nDR   EuropePMC; PMC2739203; 19682364.\nDR   EuropePMC; PMC3164640; 21909271.\nDR   EuropePMC; PMC452597; 15146055.\nXX\nCC   [1] suggests that the insulin receptor may be the cellular homolog\nCC   of the v-ros transforming (oncogene) protein.  [1] notes\nCC   similarities between the insulin receptor and several growth factor\nCC   receptors and oncogenes.  Insulin receptor is a heterodimer\nCC   consisting of 2 alpha and 2 beta subunits.  Beta-prime may be a\nCC   cleavage product produced upon binding of insulin.  [1] suggests\nCC   that translation may begin at the \'atg\' start codon at positions\nCC   79-81 with protein cleavage occurring after position 120 to yield\nCC   the signal peptide.  [1] gives illustrations of the various domains\nCC   present in the protein.  A draft entry and sequence for [1] in\nCC   computer-readable form were kindly provided by K. Jarnagin\nCC   (30-JUL-1985).\nXX\nFH   Key             Location/Qualifiers\nFH\nFT   source          1..4723\nFT                   /organism="Homo sapiens"\nFT                   /map="19p13.3-p13.2"\nFT                   /mol_type="mRNA"\nFT                   /db_xref="taxon:9606"\nFT   sig_peptide     137..219\nFT                   /note="insulin receptor signal peptide"\nFT   CDS             139..4287\nFT                   /codon_start=1\nFT                   /gene="INSR"\nFT                   /note="insulin receptor precursor"\nFT                   /db_xref="GOA:P06213"\nFT                   /db_xref="H-InvDB:HIT000194074.15"\nFT                   /db_xref="HGNC:HGNC:6091"\nFT                   /db_xref="InterPro:IPR000494"\nFT                   /db_xref="InterPro:IPR000719"\nFT                   /db_xref="InterPro:IPR001245"\nFT                   /db_xref="InterPro:IPR002011"\nFT                   /db_xref="InterPro:IPR003961"\nFT                   /db_xref="InterPro:IPR006211"\nFT                   /db_xref="InterPro:IPR006212"\nFT                   /db_xref="InterPro:IPR008266"\nFT                   /db_xref="InterPro:IPR009030"\nFT                   /db_xref="InterPro:IPR011009"\nFT                   /db_xref="InterPro:IPR013783"\nFT                   /db_xref="InterPro:IPR016246"\nFT                   /db_xref="InterPro:IPR017441"\nFT                   /db_xref="InterPro:IPR020635"\nFT                   /db_xref="InterPro:IPR032675"\nFT                   /db_xref="PDB:1GAG"\nFT                   /db_xref="PDB:1I44"\nFT                   /db_xref="PDB:1IR3"\nFT                   /db_xref="PDB:1IRK"\nFT                   /db_xref="PDB:1P14"\nFT                   /db_xref="PDB:1RQQ"\nFT                   /db_xref="PDB:2AUH"\nFT                   /db_xref="PDB:2B4S"\nFT                   /db_xref="PDB:2HR7"\nFT                   /db_xref="PDB:2MFR"\nFT                   /db_xref="PDB:2Z8C"\nFT                   /db_xref="PDB:3BU3"\nFT                   /db_xref="PDB:3BU5"\nFT                   /db_xref="PDB:3BU6"\nFT                   /db_xref="PDB:3EKK"\nFT                   /db_xref="PDB:3EKN"\nFT                   /db_xref="PDB:3ETA"\nFT                   /db_xref="PDB:3W11"\nFT                   /db_xref="PDB:3W12"\nFT                   /db_xref="PDB:3W13"\nFT                   /db_xref="PDB:3W14"\nFT                   /db_xref="PDB:4IBM"\nFT                   /db_xref="PDB:4OGA"\nFT                   /db_xref="PDB:4XLV"\nFT                   /db_xref="PDB:4XSS"\nFT                   /db_xref="PDB:4XST"\nFT                   /db_xref="PDB:4ZXB"\nFT                   /db_xref="PDB:5E1S"\nFT                   /db_xref="PDB:5HHW"\nFT                   /db_xref="UniProtKB/Swiss-Prot:P06213"\nFT                   /protein_id="AAA59174.1"\nFT                   /translation="MGTGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPGMDIRNNLT\nFT                   RLHELENCSVIEGHLQILLMFKTRPEDFRDLSFPKLIMITDYLLLFRVYGLESLKDLFP\nFT                   NLTVIRGSRLFFNYALVIFEMVHLKELGLYNLMNITRGSVRIEKNNELCYLATIDWSRI\nFT                   LDSVEDNHIVLNKDDNEECGDICPGTAKGKTNCPATVINGQFVERCWTHSHCQKVCPTI\nFT                   CKSHGCTAEGLCCHSECLGNCSQPDDPTKCVACRNFYLDGRCVETCPPPYYHFQDWRCV\nFT                   NFSFCQDLHHKCKNSRRQGCHQYVIHNNKCIPECPSGYTMNSSNLLCTPCLGPCPKVCH\nFT                   LLEGEKTIDSVTSAQELRGCTVINGSLIINIRGGNNLAAELEANLGLIEEISGYLKIRR\nFT                   SYALVSLSFFRKLRLIRGETLEIGNYSFYALDNQNLRQLWDWSKHNLTTTQGKLFFHYN\nFT                   PKLCLSEIHKMEEVSGTKGRQERNDIALKTNGDKASCENELLKFSYIRTSFDKILLRWE\nFT                   PYWPPDFRDLLGFMLFYKEAPYQNVTEFDGQDACGSNSWTVVDIDPPLRSNDPKSQNHP\nFT                   GWLMRGLKPWTQYAIFVKTLVTFSDERRTYGAKSDIIYVQTDATNPSVPLDPISVSNSS\nFT                   SQIILKWKPPSDPNGNITHYLVFWERQAEDSELFELDYCLKGLKLPSRTWSPPFESEDS\nFT                   QKHNQSEYEDSAGECCSCPKTDSQILKELEESSFRKTFEDYLHNVVFVPRKTSSGTGAE\nFT                   DPRPSRKRRSLGDVGNVTVAVPTVAAFPNTSSTSVPTSPEEHRPFEKVVNKESLVISGL\nFT                   RHFTGYRIELQACNQDTPEERCSVAAYVSARTMPEAKADDIVGPVTHEIFENNVVHLMW\nFT                   QEPKEPNGLIVLYEVSYRRYGDEELHLCVSRKHFALERGCRLRGLSPGNYSVRIRATSL\nFT                   AGNGSWTEPTYFYVTDYLDVPSNIAKIIIGPLIFVFLFSVVIGSIYLFLRKRQPDGPLG\nFT                   PLYASSNPEYLSASDVFPCSVYVPDEWEVSREKITLLRELGQGSFGMVYEGNARDIIKG\nFT                   EAETRVAVKTVNESASLRERIEFLNEASVMKGFTCHHVVRLLGVVSKGQPTLVVMELMA\nFT                   HGDLKSYLRSLRPEAENNPGRPPPTLQEMIQMAAEIADGMAYLNAKKFVHRDLAARNCM\nFT                   VAHDFTVKIGDFGMTRDIYETDYYRKGGKGLLPVRWMAPESLKDGVFTTSSDMWSFGVV\nFT                   LWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDNCPERVTDLMRMCWQFNPKMRPTFL\nFT                   EIVNLLKDDLHPSFPEVSFFHSEENKAPESEELEMEFEDMENVPLDRSSHCQREEAGGR\nFT                   DGGSSLGFKRSYEEHIPYTHMNGGKKNGRILTLPRSNPS"\nFT   mat_peptide     220..2424\nFT                   /gene="INSR"\nFT                   /note="insulin receptor alpha subunit"\nFT   mat_peptide     2425..4284\nFT                   /gene="INSR"\nFT                   /note="insulin receptor beta subunit"\nFT   mat_peptide     2425..2469\nFT                   /partial\nFT                   /gene="INSR"\nFT                   /note="insulin receptor beta-prime subunit"\nXX\nSQ   Sequence 4723 BP; 1068 A; 1298 C; 1311 G; 1046 T; 0 other;\n     ggggggctgc gcggccgggt cggtgcgcac acgagaagga cgcgcggccc ccagcgctct        60\n     tgggggccgc ctcggagcat gacccccgcg ggccagcgcc gcgcgcctga tccgaggaga       120\n     ccccgcgctc ccgcagccat gggcaccggg ggccggcggg gggcggcggc cgcgccgctg       180\n     ctggtggcgg tggccgcgct gctactgggc gccgcgggcc acctgtaccc cggagaggtg       240\n     tgtcccggca tggatatccg gaacaacctc actaggttgc atgagctgga gaattgctct       300\n     gtcatcgaag gacacttgca gatactcttg atgttcaaaa cgaggcccga agatttccga       360\n     gacctcagtt tccccaaact catcatgatc actgattact tgctgctctt ccgggtctat       420\n     gggctcgaga gcctgaagga cctgttcccc aacctcacgg tcatccgggg atcacgactg       480\n     ttctttaact acgcgctggt catcttcgag atggttcacc tcaaggaact cggcctctac       540\n     aacctgatga acatcacccg gggttctgtc cgcatcgaga agaacaatga gctctgttac       600\n     ttggccacta tcgactggtc ccgtatcctg gattccgtgg aggataatca catcgtgttg       660\n     aacaaagatg acaacgagga gtgtggagac atctgtccgg gtaccgcgaa gggcaagacc       720\n     aactgccccg ccaccgtcat caacgggcag tttgtcgaac gatgttggac tcatagtcac       780\n     tgccagaaag tttgcccgac catctgtaag tcacacggct gcaccgccga aggcctctgt       840\n     tgccacagcg agtgcctggg caactgttct cagcccgacg accccaccaa gtgcgtggcc       900\n     tgccgcaact tctacctgga cggcaggtgt gtggagacct gcccgccccc gtactaccac       960\n     ttccaggact ggcgctgtgt gaacttcagc ttctgccagg acctgcacca caaatgcaag      1020\n     aactcgcgga ggcagggctg ccaccaatac gtcattcaca acaacaagtg catccctgag      1080\n     tgtccctccg ggtacacgat gaattccagc aacttgctgt gcaccccatg cctgggtccc      1140\n     tgtcccaagg tgtgccacct cctagaaggc gagaagacca tcgactcggt gacgtctgcc      1200\n     caggagctcc gaggatgcac cgtcatcaac gggagtctga tcatcaacat tcgaggaggc      1260\n     aacaatctgg cagctgagct agaagccaac ctcggcctca ttgaagaaat ttcagggtat      1320\n     ctaaaaatcc gccgatccta cgctctggtg tcactttcct tcttccggaa gttacgtctg      1380\n     attcgaggag agaccttgga aattgggaac tactccttct atgccttgga caaccagaac      1440\n     ctaaggcagc tctgggactg gagcaaacac aacctcacca ccactcaggg gaaactcttc      1500\n     ttccactata accccaaact ctgcttgtca gaaatccaca agatggaaga agtttcagga      1560\n     accaaggggc gccaggagag aaacgacatt gccctgaaga ccaatgggga caaggcatcc      1620\n     tgtgaaaatg agttacttaa attttcttac attcggacat cttttgacaa gatcttgctg      1680\n     agatgggagc cgtactggcc ccccgacttc cgagacctct tggggttcat gctgttctac      1740\n     aaagaggccc cttatcagaa tgtgacggag ttcgatgggc aggatgcgtg tggttccaac      1800\n     agttggacgg tggtagacat tgacccaccc ctgaggtcca acgaccccaa atcacagaac      1860\n     cacccagggt ggctgatgcg gggtctcaag ccctggaccc agtatgccat ctttgtgaag      1920\n     accctggtca ccttttcgga tgaacgccgg acctatgggg ccaagagtga catcatttat      1980\n     gtccagacag atgccaccaa cccctctgtg cccctggatc caatctcagt gtctaactca      2040\n     tcatcccaga ttattctgaa gtggaaacca ccctccgacc ccaatggcaa catcacccac      2100\n     tacctggttt tctgggagag gcaggcggaa gacagtgagc tgttcgagct ggattattgc      2160\n     ctcaaagggc tgaagctgcc ctcgaggacc tggtctccac cattcgagtc tgaagattct      2220\n     cagaagcaca accagagtga gtatgaggat tcggccggcg aatgctgctc ctgtccaaag      2280\n     acagactctc agatcctgaa ggagctggag gagtcctcgt ttaggaagac gtttgaggat      2340\n     tacctgcaca acgtggtttt cgtccccaga aaaacctctt caggcactgg tgccgaggac      2400\n     cctaggccat ctcggaaacg caggtccctt ggcgatgttg ggaatgtgac ggtggccgtg      2460\n     cccacggtgg cagctttccc caacacttcc tcgaccagcg tgcccacgag tccggaggag      2520\n     cacaggcctt ttgagaaggt ggtgaacaag gagtcgctgg tcatctccgg cttgcgacac      2580\n     ttcacgggct atcgcatcga gctgcaggct tgcaaccagg acacccctga ggaacggtgc      2640\n     agtgtggcag cctacgtcag tgcgaggacc atgcctgaag ccaaggctga tgacattgtt      2700\n     ggccctgtga cgcatgaaat ctttgagaac aacgtcgtcc acttgatgtg gcaggagccg      2760\n     aaggagccca atggtctgat cgtgctgtat gaagtgagtt atcggcgata tggtgatgag      2820\n     gagctgcatc tctgcgtctc ccgcaagcac ttcgctctgg aacggggctg caggctgcgt      2880\n     gggctgtcac cggggaacta cagcgtgcga atccgggcca cctcccttgc gggcaacggc      2940\n     tcttggacgg aacccaccta tttctacgtg acagactatt tagacgtccc gtcaaatatt      3000\n     gcaaaaatta tcatcggccc cctcatcttt gtctttctct tcagtgttgt gattggaagt      3060\n     atttatctat tcctgagaaa gaggcagcca gatgggccgc tgggaccgct ttacgcttct      3120\n     tcaaaccctg agtatctcag tgccagtgat gtgtttccat gctctgtgta cgtgccggac      3180\n     gagtgggagg tgtctcgaga gaagatcacc ctccttcgag agctggggca gggctccttc      3240\n     ggcatggtgt atgagggcaa tgccagggac atcatcaagg gtgaggcaga gacccgcgtg      3300\n     gcggtgaaga cggtcaacga gtcagccagt ctccgagagc ggattgagtt cctcaatgag      3360\n     gcctcggtca tgaagggctt cacctgccat cacgtggtgc gcctcctggg agtggtgtcc      3420\n     aagggccagc ccacgctggt ggtgatggag ctgatggctc acggagacct gaagagctac      3480\n     ctccgttctc tgcggccaga ggctgagaat aatcctggcc gccctccccc tacccttcaa      3540\n     gagatgattc agatggcggc agagattgct gacgggatgg cctacctgaa cgccaagaag      3600\n     tttgtgcatc gggacctggc agcgagaaac tgcatggtcg cccatgattt tactgtcaaa      3660\n     attggagact ttggaatgac cagagacatc tatgaaacgg attactaccg gaaagggggc      3720\n     aagggtctgc tccctgtacg gtggatggca ccggagtccc tgaaggatgg ggtcttcacc      3780\n     acttcttctg acatgtggtc ctttggcgtg gtcctttggg aaatcaccag cttggcagaa      3840\n     cagccttacc aaggcctgtc taatgaacag gtgttgaaat ttgtcatgga tggagggtat      3900\n     ctggatcaac ccgacaactg tccagagaga gtcactgacc tcatgcgcat gtgctggcaa      3960\n     ttcaacccca agatgaggcc aaccttcctg gagattgtca acctgctcaa ggacgacctg      4020\n     caccccagct ttccagaggt gtcgttcttc cacagcgagg agaacaaggc tcccgagagt      4080\n     gaggagctgg agatggagtt tgaggacatg gagaatgtgc ccctggaccg ttcctcgcac      4140\n     tgtcagaggg aggaggcggg gggccgggat ggagggtcct cgctgggttt caagcggagc      4200\n     tacgaggaac acatccctta cacacacatg aacggaggca agaaaaacgg gcggattctg      4260\n     accttgcctc ggtccaatcc ttcctaacag tgcctaccgt ggcgggggcg ggcaggggtt      4320\n     cccattttcg ctttcctctg gtttgaaagc ctctggaaaa ctcaggattc tcacgactct      4380\n     accatgtcca gtggagttca gagatcgttc ctatacattt ctgttcatct taaggtggac      4440\n     tcgtttggtt accaatttaa ctagtcctgc agaggattta actgtgaacc tggagggcaa      4500\n     ggggtttcca cagttgctgc tcctttgggg caacgacggt ttcaaaccag gattttgtgt      4560\n     tttttcgttc cccccacccg cccccagcag atggaaagaa agcacctgtt tttacaaatt      4620\n     cttttttttt tttttttttt tttttttttg ctggtgtctg agcttcagta taaaagacaa      4680\n     aacttcctgt ttgtggaaca aaatttcgaa agaaaaaacc aaa                        4723\n//\n'

Estrazione dell'identificatore univoco e dell'organismo dell'entry EMBL

Estrarre dal record ID l'identificatore univoco e l'organismo nelle variabili identifier e organism.

ID   M10051; SV 1; linear; mRNA; STD; HUM; 4723 BP.

Estrazione dell'organismo.


In [6]:
s = re.search('([\w\s]+;){5}\s+(\w+);', file_str, re.M)
organism = s.group(2)

In [7]:
organism


Out[7]:
'HUM'

Estrazione dell'identificatore.


In [8]:
s = re.search('^ID\s+(\w+);', file_str, re.M)
identifier = s.group(1)

In [9]:
identifier


Out[9]:
'M10051'

Produzione della sequenza nucleotidica in formato FASTA

Estrarre nella lista seq_row_list i record della sequenza nucleotidica escludendo solo l'intero finale (mantenendo gli spazi iniziali e gli spazi prima dell'intero).

tgggggccgc ctcggagcat gacccccgcg ggccagcgcc gcgcgcctga tccgaggaga       120

In [10]:
seq_row_list = re.findall('^\W{2}(\D+)\d+', file_str, re.M)

In [11]:
seq_row_list


Out[11]:
['   ggggggctgc gcggccgggt cggtgcgcac acgagaagga cgcgcggccc ccagcgctct        ',
 '   tgggggccgc ctcggagcat gacccccgcg ggccagcgcc gcgcgcctga tccgaggaga       ',
 '   ccccgcgctc ccgcagccat gggcaccggg ggccggcggg gggcggcggc cgcgccgctg       ',
 '   ctggtggcgg tggccgcgct gctactgggc gccgcgggcc acctgtaccc cggagaggtg       ',
 '   tgtcccggca tggatatccg gaacaacctc actaggttgc atgagctgga gaattgctct       ',
 '   gtcatcgaag gacacttgca gatactcttg atgttcaaaa cgaggcccga agatttccga       ',
 '   gacctcagtt tccccaaact catcatgatc actgattact tgctgctctt ccgggtctat       ',
 '   gggctcgaga gcctgaagga cctgttcccc aacctcacgg tcatccgggg atcacgactg       ',
 '   ttctttaact acgcgctggt catcttcgag atggttcacc tcaaggaact cggcctctac       ',
 '   aacctgatga acatcacccg gggttctgtc cgcatcgaga agaacaatga gctctgttac       ',
 '   ttggccacta tcgactggtc ccgtatcctg gattccgtgg aggataatca catcgtgttg       ',
 '   aacaaagatg acaacgagga gtgtggagac atctgtccgg gtaccgcgaa gggcaagacc       ',
 '   aactgccccg ccaccgtcat caacgggcag tttgtcgaac gatgttggac tcatagtcac       ',
 '   tgccagaaag tttgcccgac catctgtaag tcacacggct gcaccgccga aggcctctgt       ',
 '   tgccacagcg agtgcctggg caactgttct cagcccgacg accccaccaa gtgcgtggcc       ',
 '   tgccgcaact tctacctgga cggcaggtgt gtggagacct gcccgccccc gtactaccac       ',
 '   ttccaggact ggcgctgtgt gaacttcagc ttctgccagg acctgcacca caaatgcaag      ',
 '   aactcgcgga ggcagggctg ccaccaatac gtcattcaca acaacaagtg catccctgag      ',
 '   tgtccctccg ggtacacgat gaattccagc aacttgctgt gcaccccatg cctgggtccc      ',
 '   tgtcccaagg tgtgccacct cctagaaggc gagaagacca tcgactcggt gacgtctgcc      ',
 '   caggagctcc gaggatgcac cgtcatcaac gggagtctga tcatcaacat tcgaggaggc      ',
 '   aacaatctgg cagctgagct agaagccaac ctcggcctca ttgaagaaat ttcagggtat      ',
 '   ctaaaaatcc gccgatccta cgctctggtg tcactttcct tcttccggaa gttacgtctg      ',
 '   attcgaggag agaccttgga aattgggaac tactccttct atgccttgga caaccagaac      ',
 '   ctaaggcagc tctgggactg gagcaaacac aacctcacca ccactcaggg gaaactcttc      ',
 '   ttccactata accccaaact ctgcttgtca gaaatccaca agatggaaga agtttcagga      ',
 '   accaaggggc gccaggagag aaacgacatt gccctgaaga ccaatgggga caaggcatcc      ',
 '   tgtgaaaatg agttacttaa attttcttac attcggacat cttttgacaa gatcttgctg      ',
 '   agatgggagc cgtactggcc ccccgacttc cgagacctct tggggttcat gctgttctac      ',
 '   aaagaggccc cttatcagaa tgtgacggag ttcgatgggc aggatgcgtg tggttccaac      ',
 '   agttggacgg tggtagacat tgacccaccc ctgaggtcca acgaccccaa atcacagaac      ',
 '   cacccagggt ggctgatgcg gggtctcaag ccctggaccc agtatgccat ctttgtgaag      ',
 '   accctggtca ccttttcgga tgaacgccgg acctatgggg ccaagagtga catcatttat      ',
 '   gtccagacag atgccaccaa cccctctgtg cccctggatc caatctcagt gtctaactca      ',
 '   tcatcccaga ttattctgaa gtggaaacca ccctccgacc ccaatggcaa catcacccac      ',
 '   tacctggttt tctgggagag gcaggcggaa gacagtgagc tgttcgagct ggattattgc      ',
 '   ctcaaagggc tgaagctgcc ctcgaggacc tggtctccac cattcgagtc tgaagattct      ',
 '   cagaagcaca accagagtga gtatgaggat tcggccggcg aatgctgctc ctgtccaaag      ',
 '   acagactctc agatcctgaa ggagctggag gagtcctcgt ttaggaagac gtttgaggat      ',
 '   tacctgcaca acgtggtttt cgtccccaga aaaacctctt caggcactgg tgccgaggac      ',
 '   cctaggccat ctcggaaacg caggtccctt ggcgatgttg ggaatgtgac ggtggccgtg      ',
 '   cccacggtgg cagctttccc caacacttcc tcgaccagcg tgcccacgag tccggaggag      ',
 '   cacaggcctt ttgagaaggt ggtgaacaag gagtcgctgg tcatctccgg cttgcgacac      ',
 '   ttcacgggct atcgcatcga gctgcaggct tgcaaccagg acacccctga ggaacggtgc      ',
 '   agtgtggcag cctacgtcag tgcgaggacc atgcctgaag ccaaggctga tgacattgtt      ',
 '   ggccctgtga cgcatgaaat ctttgagaac aacgtcgtcc acttgatgtg gcaggagccg      ',
 '   aaggagccca atggtctgat cgtgctgtat gaagtgagtt atcggcgata tggtgatgag      ',
 '   gagctgcatc tctgcgtctc ccgcaagcac ttcgctctgg aacggggctg caggctgcgt      ',
 '   gggctgtcac cggggaacta cagcgtgcga atccgggcca cctcccttgc gggcaacggc      ',
 '   tcttggacgg aacccaccta tttctacgtg acagactatt tagacgtccc gtcaaatatt      ',
 '   gcaaaaatta tcatcggccc cctcatcttt gtctttctct tcagtgttgt gattggaagt      ',
 '   atttatctat tcctgagaaa gaggcagcca gatgggccgc tgggaccgct ttacgcttct      ',
 '   tcaaaccctg agtatctcag tgccagtgat gtgtttccat gctctgtgta cgtgccggac      ',
 '   gagtgggagg tgtctcgaga gaagatcacc ctccttcgag agctggggca gggctccttc      ',
 '   ggcatggtgt atgagggcaa tgccagggac atcatcaagg gtgaggcaga gacccgcgtg      ',
 '   gcggtgaaga cggtcaacga gtcagccagt ctccgagagc ggattgagtt cctcaatgag      ',
 '   gcctcggtca tgaagggctt cacctgccat cacgtggtgc gcctcctggg agtggtgtcc      ',
 '   aagggccagc ccacgctggt ggtgatggag ctgatggctc acggagacct gaagagctac      ',
 '   ctccgttctc tgcggccaga ggctgagaat aatcctggcc gccctccccc tacccttcaa      ',
 '   gagatgattc agatggcggc agagattgct gacgggatgg cctacctgaa cgccaagaag      ',
 '   tttgtgcatc gggacctggc agcgagaaac tgcatggtcg cccatgattt tactgtcaaa      ',
 '   attggagact ttggaatgac cagagacatc tatgaaacgg attactaccg gaaagggggc      ',
 '   aagggtctgc tccctgtacg gtggatggca ccggagtccc tgaaggatgg ggtcttcacc      ',
 '   acttcttctg acatgtggtc ctttggcgtg gtcctttggg aaatcaccag cttggcagaa      ',
 '   cagccttacc aaggcctgtc taatgaacag gtgttgaaat ttgtcatgga tggagggtat      ',
 '   ctggatcaac ccgacaactg tccagagaga gtcactgacc tcatgcgcat gtgctggcaa      ',
 '   ttcaacccca agatgaggcc aaccttcctg gagattgtca acctgctcaa ggacgacctg      ',
 '   caccccagct ttccagaggt gtcgttcttc cacagcgagg agaacaaggc tcccgagagt      ',
 '   gaggagctgg agatggagtt tgaggacatg gagaatgtgc ccctggaccg ttcctcgcac      ',
 '   tgtcagaggg aggaggcggg gggccgggat ggagggtcct cgctgggttt caagcggagc      ',
 '   tacgaggaac acatccctta cacacacatg aacggaggca agaaaaacgg gcggattctg      ',
 '   accttgcctc ggtccaatcc ttcctaacag tgcctaccgt ggcgggggcg ggcaggggtt      ',
 '   cccattttcg ctttcctctg gtttgaaagc ctctggaaaa ctcaggattc tcacgactct      ',
 '   accatgtcca gtggagttca gagatcgttc ctatacattt ctgttcatct taaggtggac      ',
 '   tcgtttggtt accaatttaa ctagtcctgc agaggattta actgtgaacc tggagggcaa      ',
 '   ggggtttcca cagttgctgc tcctttgggg caacgacggt ttcaaaccag gattttgtgt      ',
 '   tttttcgttc cccccacccg cccccagcag atggaaagaa agcacctgtt tttacaaatt      ',
 '   cttttttttt tttttttttt tttttttttg ctggtgtctg agcttcagta taaaagacaa      ',
 '   aacttcctgt ttgtggaaca aaatttcgaa agaaaaaacc aaa                        ']

Estrarre da seq_row_list la lista seq_chunk_list contenente i chunks di (al più) lunghezza 10 della sequenza nucleotidica.

NOTA BENE: l'elemento seq_chunk_list[i] è una lista annidata e contiene i sei chunks relativi all'i-esimo record di seq_row_list.


In [12]:
seq_chunk_list = [re.findall('\w+', row) for row in seq_row_list]

In [13]:
seq_chunk_list


Out[13]:
[['ggggggctgc',
  'gcggccgggt',
  'cggtgcgcac',
  'acgagaagga',
  'cgcgcggccc',
  'ccagcgctct'],
 ['tgggggccgc',
  'ctcggagcat',
  'gacccccgcg',
  'ggccagcgcc',
  'gcgcgcctga',
  'tccgaggaga'],
 ['ccccgcgctc',
  'ccgcagccat',
  'gggcaccggg',
  'ggccggcggg',
  'gggcggcggc',
  'cgcgccgctg'],
 ['ctggtggcgg',
  'tggccgcgct',
  'gctactgggc',
  'gccgcgggcc',
  'acctgtaccc',
  'cggagaggtg'],
 ['tgtcccggca',
  'tggatatccg',
  'gaacaacctc',
  'actaggttgc',
  'atgagctgga',
  'gaattgctct'],
 ['gtcatcgaag',
  'gacacttgca',
  'gatactcttg',
  'atgttcaaaa',
  'cgaggcccga',
  'agatttccga'],
 ['gacctcagtt',
  'tccccaaact',
  'catcatgatc',
  'actgattact',
  'tgctgctctt',
  'ccgggtctat'],
 ['gggctcgaga',
  'gcctgaagga',
  'cctgttcccc',
  'aacctcacgg',
  'tcatccgggg',
  'atcacgactg'],
 ['ttctttaact',
  'acgcgctggt',
  'catcttcgag',
  'atggttcacc',
  'tcaaggaact',
  'cggcctctac'],
 ['aacctgatga',
  'acatcacccg',
  'gggttctgtc',
  'cgcatcgaga',
  'agaacaatga',
  'gctctgttac'],
 ['ttggccacta',
  'tcgactggtc',
  'ccgtatcctg',
  'gattccgtgg',
  'aggataatca',
  'catcgtgttg'],
 ['aacaaagatg',
  'acaacgagga',
  'gtgtggagac',
  'atctgtccgg',
  'gtaccgcgaa',
  'gggcaagacc'],
 ['aactgccccg',
  'ccaccgtcat',
  'caacgggcag',
  'tttgtcgaac',
  'gatgttggac',
  'tcatagtcac'],
 ['tgccagaaag',
  'tttgcccgac',
  'catctgtaag',
  'tcacacggct',
  'gcaccgccga',
  'aggcctctgt'],
 ['tgccacagcg',
  'agtgcctggg',
  'caactgttct',
  'cagcccgacg',
  'accccaccaa',
  'gtgcgtggcc'],
 ['tgccgcaact',
  'tctacctgga',
  'cggcaggtgt',
  'gtggagacct',
  'gcccgccccc',
  'gtactaccac'],
 ['ttccaggact',
  'ggcgctgtgt',
  'gaacttcagc',
  'ttctgccagg',
  'acctgcacca',
  'caaatgcaag'],
 ['aactcgcgga',
  'ggcagggctg',
  'ccaccaatac',
  'gtcattcaca',
  'acaacaagtg',
  'catccctgag'],
 ['tgtccctccg',
  'ggtacacgat',
  'gaattccagc',
  'aacttgctgt',
  'gcaccccatg',
  'cctgggtccc'],
 ['tgtcccaagg',
  'tgtgccacct',
  'cctagaaggc',
  'gagaagacca',
  'tcgactcggt',
  'gacgtctgcc'],
 ['caggagctcc',
  'gaggatgcac',
  'cgtcatcaac',
  'gggagtctga',
  'tcatcaacat',
  'tcgaggaggc'],
 ['aacaatctgg',
  'cagctgagct',
  'agaagccaac',
  'ctcggcctca',
  'ttgaagaaat',
  'ttcagggtat'],
 ['ctaaaaatcc',
  'gccgatccta',
  'cgctctggtg',
  'tcactttcct',
  'tcttccggaa',
  'gttacgtctg'],
 ['attcgaggag',
  'agaccttgga',
  'aattgggaac',
  'tactccttct',
  'atgccttgga',
  'caaccagaac'],
 ['ctaaggcagc',
  'tctgggactg',
  'gagcaaacac',
  'aacctcacca',
  'ccactcaggg',
  'gaaactcttc'],
 ['ttccactata',
  'accccaaact',
  'ctgcttgtca',
  'gaaatccaca',
  'agatggaaga',
  'agtttcagga'],
 ['accaaggggc',
  'gccaggagag',
  'aaacgacatt',
  'gccctgaaga',
  'ccaatgggga',
  'caaggcatcc'],
 ['tgtgaaaatg',
  'agttacttaa',
  'attttcttac',
  'attcggacat',
  'cttttgacaa',
  'gatcttgctg'],
 ['agatgggagc',
  'cgtactggcc',
  'ccccgacttc',
  'cgagacctct',
  'tggggttcat',
  'gctgttctac'],
 ['aaagaggccc',
  'cttatcagaa',
  'tgtgacggag',
  'ttcgatgggc',
  'aggatgcgtg',
  'tggttccaac'],
 ['agttggacgg',
  'tggtagacat',
  'tgacccaccc',
  'ctgaggtcca',
  'acgaccccaa',
  'atcacagaac'],
 ['cacccagggt',
  'ggctgatgcg',
  'gggtctcaag',
  'ccctggaccc',
  'agtatgccat',
  'ctttgtgaag'],
 ['accctggtca',
  'ccttttcgga',
  'tgaacgccgg',
  'acctatgggg',
  'ccaagagtga',
  'catcatttat'],
 ['gtccagacag',
  'atgccaccaa',
  'cccctctgtg',
  'cccctggatc',
  'caatctcagt',
  'gtctaactca'],
 ['tcatcccaga',
  'ttattctgaa',
  'gtggaaacca',
  'ccctccgacc',
  'ccaatggcaa',
  'catcacccac'],
 ['tacctggttt',
  'tctgggagag',
  'gcaggcggaa',
  'gacagtgagc',
  'tgttcgagct',
  'ggattattgc'],
 ['ctcaaagggc',
  'tgaagctgcc',
  'ctcgaggacc',
  'tggtctccac',
  'cattcgagtc',
  'tgaagattct'],
 ['cagaagcaca',
  'accagagtga',
  'gtatgaggat',
  'tcggccggcg',
  'aatgctgctc',
  'ctgtccaaag'],
 ['acagactctc',
  'agatcctgaa',
  'ggagctggag',
  'gagtcctcgt',
  'ttaggaagac',
  'gtttgaggat'],
 ['tacctgcaca',
  'acgtggtttt',
  'cgtccccaga',
  'aaaacctctt',
  'caggcactgg',
  'tgccgaggac'],
 ['cctaggccat',
  'ctcggaaacg',
  'caggtccctt',
  'ggcgatgttg',
  'ggaatgtgac',
  'ggtggccgtg'],
 ['cccacggtgg',
  'cagctttccc',
  'caacacttcc',
  'tcgaccagcg',
  'tgcccacgag',
  'tccggaggag'],
 ['cacaggcctt',
  'ttgagaaggt',
  'ggtgaacaag',
  'gagtcgctgg',
  'tcatctccgg',
  'cttgcgacac'],
 ['ttcacgggct',
  'atcgcatcga',
  'gctgcaggct',
  'tgcaaccagg',
  'acacccctga',
  'ggaacggtgc'],
 ['agtgtggcag',
  'cctacgtcag',
  'tgcgaggacc',
  'atgcctgaag',
  'ccaaggctga',
  'tgacattgtt'],
 ['ggccctgtga',
  'cgcatgaaat',
  'ctttgagaac',
  'aacgtcgtcc',
  'acttgatgtg',
  'gcaggagccg'],
 ['aaggagccca',
  'atggtctgat',
  'cgtgctgtat',
  'gaagtgagtt',
  'atcggcgata',
  'tggtgatgag'],
 ['gagctgcatc',
  'tctgcgtctc',
  'ccgcaagcac',
  'ttcgctctgg',
  'aacggggctg',
  'caggctgcgt'],
 ['gggctgtcac',
  'cggggaacta',
  'cagcgtgcga',
  'atccgggcca',
  'cctcccttgc',
  'gggcaacggc'],
 ['tcttggacgg',
  'aacccaccta',
  'tttctacgtg',
  'acagactatt',
  'tagacgtccc',
  'gtcaaatatt'],
 ['gcaaaaatta',
  'tcatcggccc',
  'cctcatcttt',
  'gtctttctct',
  'tcagtgttgt',
  'gattggaagt'],
 ['atttatctat',
  'tcctgagaaa',
  'gaggcagcca',
  'gatgggccgc',
  'tgggaccgct',
  'ttacgcttct'],
 ['tcaaaccctg',
  'agtatctcag',
  'tgccagtgat',
  'gtgtttccat',
  'gctctgtgta',
  'cgtgccggac'],
 ['gagtgggagg',
  'tgtctcgaga',
  'gaagatcacc',
  'ctccttcgag',
  'agctggggca',
  'gggctccttc'],
 ['ggcatggtgt',
  'atgagggcaa',
  'tgccagggac',
  'atcatcaagg',
  'gtgaggcaga',
  'gacccgcgtg'],
 ['gcggtgaaga',
  'cggtcaacga',
  'gtcagccagt',
  'ctccgagagc',
  'ggattgagtt',
  'cctcaatgag'],
 ['gcctcggtca',
  'tgaagggctt',
  'cacctgccat',
  'cacgtggtgc',
  'gcctcctggg',
  'agtggtgtcc'],
 ['aagggccagc',
  'ccacgctggt',
  'ggtgatggag',
  'ctgatggctc',
  'acggagacct',
  'gaagagctac'],
 ['ctccgttctc',
  'tgcggccaga',
  'ggctgagaat',
  'aatcctggcc',
  'gccctccccc',
  'tacccttcaa'],
 ['gagatgattc',
  'agatggcggc',
  'agagattgct',
  'gacgggatgg',
  'cctacctgaa',
  'cgccaagaag'],
 ['tttgtgcatc',
  'gggacctggc',
  'agcgagaaac',
  'tgcatggtcg',
  'cccatgattt',
  'tactgtcaaa'],
 ['attggagact',
  'ttggaatgac',
  'cagagacatc',
  'tatgaaacgg',
  'attactaccg',
  'gaaagggggc'],
 ['aagggtctgc',
  'tccctgtacg',
  'gtggatggca',
  'ccggagtccc',
  'tgaaggatgg',
  'ggtcttcacc'],
 ['acttcttctg',
  'acatgtggtc',
  'ctttggcgtg',
  'gtcctttggg',
  'aaatcaccag',
  'cttggcagaa'],
 ['cagccttacc',
  'aaggcctgtc',
  'taatgaacag',
  'gtgttgaaat',
  'ttgtcatgga',
  'tggagggtat'],
 ['ctggatcaac',
  'ccgacaactg',
  'tccagagaga',
  'gtcactgacc',
  'tcatgcgcat',
  'gtgctggcaa'],
 ['ttcaacccca',
  'agatgaggcc',
  'aaccttcctg',
  'gagattgtca',
  'acctgctcaa',
  'ggacgacctg'],
 ['caccccagct',
  'ttccagaggt',
  'gtcgttcttc',
  'cacagcgagg',
  'agaacaaggc',
  'tcccgagagt'],
 ['gaggagctgg',
  'agatggagtt',
  'tgaggacatg',
  'gagaatgtgc',
  'ccctggaccg',
  'ttcctcgcac'],
 ['tgtcagaggg',
  'aggaggcggg',
  'gggccgggat',
  'ggagggtcct',
  'cgctgggttt',
  'caagcggagc'],
 ['tacgaggaac',
  'acatccctta',
  'cacacacatg',
  'aacggaggca',
  'agaaaaacgg',
  'gcggattctg'],
 ['accttgcctc',
  'ggtccaatcc',
  'ttcctaacag',
  'tgcctaccgt',
  'ggcgggggcg',
  'ggcaggggtt'],
 ['cccattttcg',
  'ctttcctctg',
  'gtttgaaagc',
  'ctctggaaaa',
  'ctcaggattc',
  'tcacgactct'],
 ['accatgtcca',
  'gtggagttca',
  'gagatcgttc',
  'ctatacattt',
  'ctgttcatct',
  'taaggtggac'],
 ['tcgtttggtt',
  'accaatttaa',
  'ctagtcctgc',
  'agaggattta',
  'actgtgaacc',
  'tggagggcaa'],
 ['ggggtttcca',
  'cagttgctgc',
  'tcctttgggg',
  'caacgacggt',
  'ttcaaaccag',
  'gattttgtgt'],
 ['tttttcgttc',
  'cccccacccg',
  'cccccagcag',
  'atggaaagaa',
  'agcacctgtt',
  'tttacaaatt'],
 ['cttttttttt',
  'tttttttttt',
  'tttttttttg',
  'ctggtgtctg',
  'agcttcagta',
  'taaaagacaa'],
 ['aacttcctgt', 'ttgtggaaca', 'aaatttcgaa', 'agaaaaaacc', 'aaa']]

Concatenare i chunks della lista seq_chunk_list per ottenere la sequenza nucleotidica nella variabile nucleotide_sequence.


In [14]:
nucleotide_sequence =''.join([''.join(list_six_chunks) for list_six_chunks in seq_chunk_list])

Sostituire in nucleotide_sequence tutti i simboli t con simboli u.


In [15]:
nucleotide_sequence = re.sub('t', 'u', nucleotide_sequence)

In [16]:
nucleotide_sequence


Out[16]:
'ggggggcugcgcggccgggucggugcgcacacgagaaggacgcgcggcccccagcgcucuugggggccgccucggagcaugacccccgcgggccagcgccgcgcgccugauccgaggagaccccgcgcucccgcagccaugggcaccgggggccggcggggggcggcggccgcgccgcugcugguggcgguggccgcgcugcuacugggcgccgcgggccaccuguaccccggagaggugugucccggcauggauauccggaacaaccucacuagguugcaugagcuggagaauugcucugucaucgaaggacacuugcagauacucuugauguucaaaacgaggcccgaagauuuccgagaccucaguuuccccaaacucaucaugaucacugauuacuugcugcucuuccgggucuaugggcucgagagccugaaggaccuguuccccaaccucacggucauccggggaucacgacuguucuuuaacuacgcgcuggucaucuucgagaugguucaccucaaggaacucggccucuacaaccugaugaacaucacccgggguucuguccgcaucgagaagaacaaugagcucuguuacuuggccacuaucgacuggucccguauccuggauuccguggaggauaaucacaucguguugaacaaagaugacaacgaggaguguggagacaucuguccggguaccgcgaagggcaagaccaacugccccgccaccgucaucaacgggcaguuugucgaacgauguuggacucauagucacugccagaaaguuugcccgaccaucuguaagucacacggcugcaccgccgaaggccucuguugccacagcgagugccugggcaacuguucucagcccgacgaccccaccaagugcguggccugccgcaacuucuaccuggacggcagguguguggagaccugcccgcccccguacuaccacuuccaggacuggcgcugugugaacuucagcuucugccaggaccugcaccacaaaugcaagaacucgcggaggcagggcugccaccaauacgucauucacaacaacaagugcaucccugagugucccuccggguacacgaugaauuccagcaacuugcugugcaccccaugccugggucccugucccaaggugugccaccuccuagaaggcgagaagaccaucgacucggugacgucugcccaggagcuccgaggaugcaccgucaucaacgggagucugaucaucaacauucgaggaggcaacaaucuggcagcugagcuagaagccaaccucggccucauugaagaaauuucaggguaucuaaaaauccgccgauccuacgcucuggugucacuuuccuucuuccggaaguuacgucugauucgaggagagaccuuggaaauugggaacuacuccuucuaugccuuggacaaccagaaccuaaggcagcucugggacuggagcaaacacaaccucaccaccacucaggggaaacucuucuuccacuauaaccccaaacucugcuugucagaaauccacaagauggaagaaguuucaggaaccaaggggcgccaggagagaaacgacauugcccugaagaccaauggggacaaggcauccugugaaaaugaguuacuuaaauuuucuuacauucggacaucuuuugacaagaucuugcugagaugggagccguacuggccccccgacuuccgagaccucuugggguucaugcuguucuacaaagaggccccuuaucagaaugugacggaguucgaugggcaggaugcgugugguuccaacaguuggacggugguagacauugacccaccccugagguccaacgaccccaaaucacagaaccacccaggguggcugaugcggggucucaagcccuggacccaguaugccaucuuugugaagacccuggucaccuuuucggaugaacgccggaccuauggggccaagagugacaucauuuauguccagacagaugccaccaaccccucugugccccuggauccaaucucagugucuaacucaucaucccagauuauucugaaguggaaaccacccuccgaccccaauggcaacaucacccacuaccugguuuucugggagaggcaggcggaagacagugagcuguucgagcuggauuauugccucaaagggcugaagcugcccucgaggaccuggucuccaccauucgagucugaagauucucagaagcacaaccagagugaguaugaggauucggccggcgaaugcugcuccuguccaaagacagacucucagauccugaaggagcuggaggaguccucguuuaggaagacguuugaggauuaccugcacaacgugguuuucguccccagaaaaaccucuucaggcacuggugccgaggacccuaggccaucucggaaacgcaggucccuuggcgauguugggaaugugacgguggccgugcccacgguggcagcuuuccccaacacuuccucgaccagcgugcccacgaguccggaggagcacaggccuuuugagaagguggugaacaaggagucgcuggucaucuccggcuugcgacacuucacgggcuaucgcaucgagcugcaggcuugcaaccaggacaccccugaggaacggugcaguguggcagccuacgucagugcgaggaccaugccugaagccaaggcugaugacauuguuggcccugugacgcaugaaaucuuugagaacaacgucguccacuugauguggcaggagccgaaggagcccaauggucugaucgugcuguaugaagugaguuaucggcgauauggugaugaggagcugcaucucugcgucucccgcaagcacuucgcucuggaacggggcugcaggcugcgugggcugucaccggggaacuacagcgugcgaauccgggccaccucccuugcgggcaacggcucuuggacggaacccaccuauuucuacgugacagacuauuuagacgucccgucaaauauugcaaaaauuaucaucggcccccucaucuuugucuuucucuucaguguugugauuggaaguauuuaucuauuccugagaaagaggcagccagaugggccgcugggaccgcuuuacgcuucuucaaacccugaguaucucagugccagugauguguuuccaugcucuguguacgugccggacgagugggaggugucucgagagaagaucacccuccuucgagagcuggggcagggcuccuucggcaugguguaugagggcaaugccagggacaucaucaagggugaggcagagacccgcguggcggugaagacggucaacgagucagccagucuccgagagcggauugaguuccucaaugaggccucggucaugaagggcuucaccugccaucacguggugcgccuccugggagugguguccaagggccagcccacgcugguggugauggagcugauggcucacggagaccugaagagcuaccuccguucucugcggccagaggcugagaauaauccuggccgcccucccccuacccuucaagagaugauucagauggcggcagagauugcugacgggauggccuaccugaacgccaagaaguuugugcaucgggaccuggcagcgagaaacugcauggucgcccaugauuuuacugucaaaauuggagacuuuggaaugaccagagacaucuaugaaacggauuacuaccggaaagggggcaagggucugcucccuguacgguggauggcaccggagucccugaaggauggggucuucaccacuucuucugacaugugguccuuuggcgugguccuuugggaaaucaccagcuuggcagaacagccuuaccaaggccugucuaaugaacagguguugaaauuugucauggauggaggguaucuggaucaacccgacaacuguccagagagagucacugaccucaugcgcaugugcuggcaauucaaccccaagaugaggccaaccuuccuggagauugucaaccugcucaaggacgaccugcaccccagcuuuccagaggugucguucuuccacagcgaggagaacaaggcucccgagagugaggagcuggagauggaguuugaggacauggagaaugugccccuggaccguuccucgcacugucagagggaggaggcggggggccgggauggaggguccucgcuggguuucaagcggagcuacgaggaacacaucccuuacacacacaugaacggaggcaagaaaaacgggcggauucugaccuugccucgguccaauccuuccuaacagugccuaccguggcgggggcgggcagggguucccauuuucgcuuuccucugguuugaaagccucuggaaaacucaggauucucacgacucuaccauguccaguggaguucagagaucguuccuauacauuucuguucaucuuaagguggacucguuugguuaccaauuuaacuaguccugcagaggauuuaacugugaaccuggagggcaagggguuuccacaguugcugcuccuuuggggcaacgacgguuucaaaccaggauuuuguguuuuuucguuccccccacccgcccccagcagauggaaagaaagcaccuguuuuuacaaauucuuuuuuuuuuuuuuuuuuuuuuuuuuuugcuggugucugagcuucaguauaaaagacaaaacuuccuguuuguggaacaaaauuucgaaagaaaaaaccaaa'

Produrre nella variabile nucleotide_sequence_fasta la sequenza nucleotidica in formato FASTA con il seguente header:

>M10051-HUM

In [17]:
header = '>' + identifier + '-' + organism
nucleotide_sequence_fasta = format_fasta(header, nucleotide_sequence)

In [18]:
print(nucleotide_sequence_fasta)


>M10051-HUM
ggggggcugcgcggccgggucggugcgcacacgagaaggacgcgcggcccccagcgcucuugggggccgccucggagcau
gacccccgcgggccagcgccgcgcgccugauccgaggagaccccgcgcucccgcagccaugggcaccgggggccggcggg
gggcggcggccgcgccgcugcugguggcgguggccgcgcugcuacugggcgccgcgggccaccuguaccccggagaggug
ugucccggcauggauauccggaacaaccucacuagguugcaugagcuggagaauugcucugucaucgaaggacacuugca
gauacucuugauguucaaaacgaggcccgaagauuuccgagaccucaguuuccccaaacucaucaugaucacugauuacu
ugcugcucuuccgggucuaugggcucgagagccugaaggaccuguuccccaaccucacggucauccggggaucacgacug
uucuuuaacuacgcgcuggucaucuucgagaugguucaccucaaggaacucggccucuacaaccugaugaacaucacccg
ggguucuguccgcaucgagaagaacaaugagcucuguuacuuggccacuaucgacuggucccguauccuggauuccgugg
aggauaaucacaucguguugaacaaagaugacaacgaggaguguggagacaucuguccggguaccgcgaagggcaagacc
aacugccccgccaccgucaucaacgggcaguuugucgaacgauguuggacucauagucacugccagaaaguuugcccgac
caucuguaagucacacggcugcaccgccgaaggccucuguugccacagcgagugccugggcaacuguucucagcccgacg
accccaccaagugcguggccugccgcaacuucuaccuggacggcagguguguggagaccugcccgcccccguacuaccac
uuccaggacuggcgcugugugaacuucagcuucugccaggaccugcaccacaaaugcaagaacucgcggaggcagggcug
ccaccaauacgucauucacaacaacaagugcaucccugagugucccuccggguacacgaugaauuccagcaacuugcugu
gcaccccaugccugggucccugucccaaggugugccaccuccuagaaggcgagaagaccaucgacucggugacgucugcc
caggagcuccgaggaugcaccgucaucaacgggagucugaucaucaacauucgaggaggcaacaaucuggcagcugagcu
agaagccaaccucggccucauugaagaaauuucaggguaucuaaaaauccgccgauccuacgcucuggugucacuuuccu
ucuuccggaaguuacgucugauucgaggagagaccuuggaaauugggaacuacuccuucuaugccuuggacaaccagaac
cuaaggcagcucugggacuggagcaaacacaaccucaccaccacucaggggaaacucuucuuccacuauaaccccaaacu
cugcuugucagaaauccacaagauggaagaaguuucaggaaccaaggggcgccaggagagaaacgacauugcccugaaga
ccaauggggacaaggcauccugugaaaaugaguuacuuaaauuuucuuacauucggacaucuuuugacaagaucuugcug
agaugggagccguacuggccccccgacuuccgagaccucuugggguucaugcuguucuacaaagaggccccuuaucagaa
ugugacggaguucgaugggcaggaugcgugugguuccaacaguuggacggugguagacauugacccaccccugaggucca
acgaccccaaaucacagaaccacccaggguggcugaugcggggucucaagcccuggacccaguaugccaucuuugugaag
acccuggucaccuuuucggaugaacgccggaccuauggggccaagagugacaucauuuauguccagacagaugccaccaa
ccccucugugccccuggauccaaucucagugucuaacucaucaucccagauuauucugaaguggaaaccacccuccgacc
ccaauggcaacaucacccacuaccugguuuucugggagaggcaggcggaagacagugagcuguucgagcuggauuauugc
cucaaagggcugaagcugcccucgaggaccuggucuccaccauucgagucugaagauucucagaagcacaaccagaguga
guaugaggauucggccggcgaaugcugcuccuguccaaagacagacucucagauccugaaggagcuggaggaguccucgu
uuaggaagacguuugaggauuaccugcacaacgugguuuucguccccagaaaaaccucuucaggcacuggugccgaggac
ccuaggccaucucggaaacgcaggucccuuggcgauguugggaaugugacgguggccgugcccacgguggcagcuuuccc
caacacuuccucgaccagcgugcccacgaguccggaggagcacaggccuuuugagaagguggugaacaaggagucgcugg
ucaucuccggcuugcgacacuucacgggcuaucgcaucgagcugcaggcuugcaaccaggacaccccugaggaacggugc
aguguggcagccuacgucagugcgaggaccaugccugaagccaaggcugaugacauuguuggcccugugacgcaugaaau
cuuugagaacaacgucguccacuugauguggcaggagccgaaggagcccaauggucugaucgugcuguaugaagugaguu
aucggcgauauggugaugaggagcugcaucucugcgucucccgcaagcacuucgcucuggaacggggcugcaggcugcgu
gggcugucaccggggaacuacagcgugcgaauccgggccaccucccuugcgggcaacggcucuuggacggaacccaccua
uuucuacgugacagacuauuuagacgucccgucaaauauugcaaaaauuaucaucggcccccucaucuuugucuuucucu
ucaguguugugauuggaaguauuuaucuauuccugagaaagaggcagccagaugggccgcugggaccgcuuuacgcuucu
ucaaacccugaguaucucagugccagugauguguuuccaugcucuguguacgugccggacgagugggaggugucucgaga
gaagaucacccuccuucgagagcuggggcagggcuccuucggcaugguguaugagggcaaugccagggacaucaucaagg
gugaggcagagacccgcguggcggugaagacggucaacgagucagccagucuccgagagcggauugaguuccucaaugag
gccucggucaugaagggcuucaccugccaucacguggugcgccuccugggagugguguccaagggccagcccacgcuggu
ggugauggagcugauggcucacggagaccugaagagcuaccuccguucucugcggccagaggcugagaauaauccuggcc
gcccucccccuacccuucaagagaugauucagauggcggcagagauugcugacgggauggccuaccugaacgccaagaag
uuugugcaucgggaccuggcagcgagaaacugcauggucgcccaugauuuuacugucaaaauuggagacuuuggaaugac
cagagacaucuaugaaacggauuacuaccggaaagggggcaagggucugcucccuguacgguggauggcaccggaguccc
ugaaggauggggucuucaccacuucuucugacaugugguccuuuggcgugguccuuugggaaaucaccagcuuggcagaa
cagccuuaccaaggccugucuaaugaacagguguugaaauuugucauggauggaggguaucuggaucaacccgacaacug
uccagagagagucacugaccucaugcgcaugugcuggcaauucaaccccaagaugaggccaaccuuccuggagauuguca
accugcucaaggacgaccugcaccccagcuuuccagaggugucguucuuccacagcgaggagaacaaggcucccgagagu
gaggagcuggagauggaguuugaggacauggagaaugugccccuggaccguuccucgcacugucagagggaggaggcggg
gggccgggauggaggguccucgcuggguuucaagcggagcuacgaggaacacaucccuuacacacacaugaacggaggca
agaaaaacgggcggauucugaccuugccucgguccaauccuuccuaacagugccuaccguggcgggggcgggcagggguu
cccauuuucgcuuuccucugguuugaaagccucuggaaaacucaggauucucacgacucuaccauguccaguggaguuca
gagaucguuccuauacauuucuguucaucuuaagguggacucguuugguuaccaauuuaacuaguccugcagaggauuua
acugugaaccuggagggcaagggguuuccacaguugcugcuccuuuggggcaacgacgguuucaaaccaggauuuugugu
uuuuucguuccccccacccgcccccagcagauggaaagaaagcaccuguuuuuacaaauucuuuuuuuuuuuuuuuuuuu
uuuuuuuuugcuggugucugagcuucaguauaaaagacaaaacuuccuguuuguggaacaaaauuucgaaagaaaaaacc
aaa

Produzione della sequenza della proteina in formato FASTA

Estrarre nella variabile protein_prefix il prefisso della proteina contenuto nel record contenente la parola /translation:

FT                   /translation="MGTGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPGMDIRNNLT

In [19]:
s = re.search('^FT\s+/translation="(\w+)$', file_str, re.M)
protein_prefix = s.group(1)

In [20]:
protein_prefix


Out[20]:
'MGTGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPGMDIRNNLT'

Estrarre nella lista protein_row_list gli altri record (compreso l'ultimo) che contengono la sequenza della proteina.

FT                   RLHELENCSVIEGHLQILLMFKTRPEDFRDLSFPKLIMITDYLLLFRVYGLESLKDLFP

Attenzione all'ultimo:

FT                   DGGSSLGFKRSYEEHIPYTHMNGGKKNGRILTLPRSNPS"

In [21]:
protein_row_list = re.findall('^FT\s+(\w+)"?$', file_str, re.M)

In [22]:
protein_row_list


Out[22]:
['RLHELENCSVIEGHLQILLMFKTRPEDFRDLSFPKLIMITDYLLLFRVYGLESLKDLFP',
 'NLTVIRGSRLFFNYALVIFEMVHLKELGLYNLMNITRGSVRIEKNNELCYLATIDWSRI',
 'LDSVEDNHIVLNKDDNEECGDICPGTAKGKTNCPATVINGQFVERCWTHSHCQKVCPTI',
 'CKSHGCTAEGLCCHSECLGNCSQPDDPTKCVACRNFYLDGRCVETCPPPYYHFQDWRCV',
 'NFSFCQDLHHKCKNSRRQGCHQYVIHNNKCIPECPSGYTMNSSNLLCTPCLGPCPKVCH',
 'LLEGEKTIDSVTSAQELRGCTVINGSLIINIRGGNNLAAELEANLGLIEEISGYLKIRR',
 'SYALVSLSFFRKLRLIRGETLEIGNYSFYALDNQNLRQLWDWSKHNLTTTQGKLFFHYN',
 'PKLCLSEIHKMEEVSGTKGRQERNDIALKTNGDKASCENELLKFSYIRTSFDKILLRWE',
 'PYWPPDFRDLLGFMLFYKEAPYQNVTEFDGQDACGSNSWTVVDIDPPLRSNDPKSQNHP',
 'GWLMRGLKPWTQYAIFVKTLVTFSDERRTYGAKSDIIYVQTDATNPSVPLDPISVSNSS',
 'SQIILKWKPPSDPNGNITHYLVFWERQAEDSELFELDYCLKGLKLPSRTWSPPFESEDS',
 'QKHNQSEYEDSAGECCSCPKTDSQILKELEESSFRKTFEDYLHNVVFVPRKTSSGTGAE',
 'DPRPSRKRRSLGDVGNVTVAVPTVAAFPNTSSTSVPTSPEEHRPFEKVVNKESLVISGL',
 'RHFTGYRIELQACNQDTPEERCSVAAYVSARTMPEAKADDIVGPVTHEIFENNVVHLMW',
 'QEPKEPNGLIVLYEVSYRRYGDEELHLCVSRKHFALERGCRLRGLSPGNYSVRIRATSL',
 'AGNGSWTEPTYFYVTDYLDVPSNIAKIIIGPLIFVFLFSVVIGSIYLFLRKRQPDGPLG',
 'PLYASSNPEYLSASDVFPCSVYVPDEWEVSREKITLLRELGQGSFGMVYEGNARDIIKG',
 'EAETRVAVKTVNESASLRERIEFLNEASVMKGFTCHHVVRLLGVVSKGQPTLVVMELMA',
 'HGDLKSYLRSLRPEAENNPGRPPPTLQEMIQMAAEIADGMAYLNAKKFVHRDLAARNCM',
 'VAHDFTVKIGDFGMTRDIYETDYYRKGGKGLLPVRWMAPESLKDGVFTTSSDMWSFGVV',
 'LWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDNCPERVTDLMRMCWQFNPKMRPTFL',
 'EIVNLLKDDLHPSFPEVSFFHSEENKAPESEELEMEFEDMENVPLDRSSHCQREEAGGR',
 'DGGSSLGFKRSYEEHIPYTHMNGGKKNGRILTLPRSNPS']

Aggiungere in testa alla lista il prefisso trovato prima e concatenare nella variabile protein_sequence tutti i blocchi della lista protein_row_list per ottenere la sequenza della proteina.


In [23]:
protein_row_list[:0] = protein_prefix
protein_sequence = ''.join(protein_row_list)

In [24]:
protein_sequence


Out[24]:
'MGTGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPGMDIRNNLTRLHELENCSVIEGHLQILLMFKTRPEDFRDLSFPKLIMITDYLLLFRVYGLESLKDLFPNLTVIRGSRLFFNYALVIFEMVHLKELGLYNLMNITRGSVRIEKNNELCYLATIDWSRILDSVEDNHIVLNKDDNEECGDICPGTAKGKTNCPATVINGQFVERCWTHSHCQKVCPTICKSHGCTAEGLCCHSECLGNCSQPDDPTKCVACRNFYLDGRCVETCPPPYYHFQDWRCVNFSFCQDLHHKCKNSRRQGCHQYVIHNNKCIPECPSGYTMNSSNLLCTPCLGPCPKVCHLLEGEKTIDSVTSAQELRGCTVINGSLIINIRGGNNLAAELEANLGLIEEISGYLKIRRSYALVSLSFFRKLRLIRGETLEIGNYSFYALDNQNLRQLWDWSKHNLTTTQGKLFFHYNPKLCLSEIHKMEEVSGTKGRQERNDIALKTNGDKASCENELLKFSYIRTSFDKILLRWEPYWPPDFRDLLGFMLFYKEAPYQNVTEFDGQDACGSNSWTVVDIDPPLRSNDPKSQNHPGWLMRGLKPWTQYAIFVKTLVTFSDERRTYGAKSDIIYVQTDATNPSVPLDPISVSNSSSQIILKWKPPSDPNGNITHYLVFWERQAEDSELFELDYCLKGLKLPSRTWSPPFESEDSQKHNQSEYEDSAGECCSCPKTDSQILKELEESSFRKTFEDYLHNVVFVPRKTSSGTGAEDPRPSRKRRSLGDVGNVTVAVPTVAAFPNTSSTSVPTSPEEHRPFEKVVNKESLVISGLRHFTGYRIELQACNQDTPEERCSVAAYVSARTMPEAKADDIVGPVTHEIFENNVVHLMWQEPKEPNGLIVLYEVSYRRYGDEELHLCVSRKHFALERGCRLRGLSPGNYSVRIRATSLAGNGSWTEPTYFYVTDYLDVPSNIAKIIIGPLIFVFLFSVVIGSIYLFLRKRQPDGPLGPLYASSNPEYLSASDVFPCSVYVPDEWEVSREKITLLRELGQGSFGMVYEGNARDIIKGEAETRVAVKTVNESASLRERIEFLNEASVMKGFTCHHVVRLLGVVSKGQPTLVVMELMAHGDLKSYLRSLRPEAENNPGRPPPTLQEMIQMAAEIADGMAYLNAKKFVHRDLAARNCMVAHDFTVKIGDFGMTRDIYETDYYRKGGKGLLPVRWMAPESLKDGVFTTSSDMWSFGVVLWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDNCPERVTDLMRMCWQFNPKMRPTFLEIVNLLKDDLHPSFPEVSFFHSEENKAPESEELEMEFEDMENVPLDRSSHCQREEAGGRDGGSSLGFKRSYEEHIPYTHMNGGKKNGRILTLPRSNPS'

Produrre nella variabile protein_sequence_fasta la sequenza della proteina in formato FASTA con il seguente header:

>M10051-HUM; len = 1382

In [25]:
header = '>' + identifier + '-' + organism + '; len = ' + str(len(protein_sequence))
protein_sequence_fasta = format_fasta(header, protein_sequence)

In [26]:
print(protein_sequence_fasta)


>M10051-HUM; len = 1382
MGTGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPGMDIRNNLTRLHELENCSVIEGHLQILLMFKTRPEDFRDLSFPK
LIMITDYLLLFRVYGLESLKDLFPNLTVIRGSRLFFNYALVIFEMVHLKELGLYNLMNITRGSVRIEKNNELCYLATIDW
SRILDSVEDNHIVLNKDDNEECGDICPGTAKGKTNCPATVINGQFVERCWTHSHCQKVCPTICKSHGCTAEGLCCHSECL
GNCSQPDDPTKCVACRNFYLDGRCVETCPPPYYHFQDWRCVNFSFCQDLHHKCKNSRRQGCHQYVIHNNKCIPECPSGYT
MNSSNLLCTPCLGPCPKVCHLLEGEKTIDSVTSAQELRGCTVINGSLIINIRGGNNLAAELEANLGLIEEISGYLKIRRS
YALVSLSFFRKLRLIRGETLEIGNYSFYALDNQNLRQLWDWSKHNLTTTQGKLFFHYNPKLCLSEIHKMEEVSGTKGRQE
RNDIALKTNGDKASCENELLKFSYIRTSFDKILLRWEPYWPPDFRDLLGFMLFYKEAPYQNVTEFDGQDACGSNSWTVVD
IDPPLRSNDPKSQNHPGWLMRGLKPWTQYAIFVKTLVTFSDERRTYGAKSDIIYVQTDATNPSVPLDPISVSNSSSQIIL
KWKPPSDPNGNITHYLVFWERQAEDSELFELDYCLKGLKLPSRTWSPPFESEDSQKHNQSEYEDSAGECCSCPKTDSQIL
KELEESSFRKTFEDYLHNVVFVPRKTSSGTGAEDPRPSRKRRSLGDVGNVTVAVPTVAAFPNTSSTSVPTSPEEHRPFEK
VVNKESLVISGLRHFTGYRIELQACNQDTPEERCSVAAYVSARTMPEAKADDIVGPVTHEIFENNVVHLMWQEPKEPNGL
IVLYEVSYRRYGDEELHLCVSRKHFALERGCRLRGLSPGNYSVRIRATSLAGNGSWTEPTYFYVTDYLDVPSNIAKIIIG
PLIFVFLFSVVIGSIYLFLRKRQPDGPLGPLYASSNPEYLSASDVFPCSVYVPDEWEVSREKITLLRELGQGSFGMVYEG
NARDIIKGEAETRVAVKTVNESASLRERIEFLNEASVMKGFTCHHVVRLLGVVSKGQPTLVVMELMAHGDLKSYLRSLRP
EAENNPGRPPPTLQEMIQMAAEIADGMAYLNAKKFVHRDLAARNCMVAHDFTVKIGDFGMTRDIYETDYYRKGGKGLLPV
RWMAPESLKDGVFTTSSDMWSFGVVLWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDNCPERVTDLMRMCWQFNPKMR
PTFLEIVNLLKDDLHPSFPEVSFFHSEENKAPESEELEMEFEDMENVPLDRSSHCQREEAGGRDGGSSLGFKRSYEEHIP
YTHMNGGKKNGRILTLPRSNPS

In [ ]: