In-Class: Having a blast with BLAST & Alignments

Part I: Using BLAST

After you blasted for the sequence in the pre-lab, you discovered that this is in fact a gene you're interested in, just in the wrong species. However, you don't have time to sequence an uncontaminated sample and so want to get the reference frog sequence. Next, you will try to identify the Xenopus laevis (frog) homolog to this gene by doing a “discontiguous megablast” search. This is an appropriate tool for finding more distant homologs.

  • Go to the BLAST home page (http://blast.ncbi.nlm.nih.gov/) and copy and paste the sequence from the pre-lab, as before.
  • Change the “Program Selection” option to “More dissimilar sequences (discontiguous megablast)”.
  • Simplify your search by setting the “Organism” to “Xenopus laevis” under the “Choose Search Set” option. (taxid: 8355
  • In the section labeled “Choose Search Set”, click the “Database” option button for “Others”, and then set the drop-down list to “Reference RNA sequences (refseq_rna).
  • Click the “BLAST” button to search.

Base your answers on the top hit returned by BLAST (that isn't a PREDICTED mRNA) :

Question 3: How much of the query is covered by the top returned hit? What is the percent identity between the two?

Question 4: What does the “E Value” of this match mean?

One simple way to infer homologs (sequences with a common evolutionary origin) is to perform a reciprocal blast. For example, if 1) blasting human gene A against the chicken genome returns chicken gene B as the best match, and 2) blasting chicken gene B against the human genome returns human gene A as the best match, then human gene A and chicken gene B are putative orthologs. You are hopeful that your research on this gene in frog can be applied to the human gene. This application will be especially clear if the genes are orthologs of each other, so you want to perform a reciprocal BLAST.

Task: We know that blasting our "sequence of mystery" against the Xenopus genome returns NM_001088537.1. What is the best hit when blasting NM_001088537.1 gene against the human genome?

  • From the results of your previous BLAST, click on the “NM_001088537.1” link in the 'Accession' column. Then, click on “FASTA” in the upper part of the page and either copy the FASTA part of the webpage and paste it into your Textpad / Wordpad, or download by clicking on “Send:” on the top-right, then “Choose Destination” -> “File” -> Format -> FASTA -> Create File.

  • Return to the Nucleotide BLAST window.

  • Paste in the new sequence from Xenopus laevis.

  • Change the “Organism” to “human” under the “Choose Search Set” option. (taxid:9606)

  • Click the “BLAST” button to search

Question 5: Are the genes in our "sequence of mystery" from the pre-lab and NM_001088537.1 putative orthologs?

Task: Now try identifying a sequence on your own. Use the following sequence:

>GCB535, Unknown Protein B MNLRFELQKLLNVCFLFASAYMFWQGLAIATNSASPIVVVLSGSMEPAFQRGDILFLWNRNTFNQVGDVV VYEVEGKQIPIVHRVLRQHNNHADKQFLLTKGDNNAGNDISLYANKKIYLNKSKEIVGTVKGYFPQLGYI TIWISENKYAKFALLGMLGLSALLGGE

• Select an appropriate search tool and database to identify the sequence. Keen in mind what type of sequence it is. You may consider the top hit to be the correct one, although you should also note that most of the other hits are conserved proteins in other species.

Question 6: What protein is the best match for the above sequence?

Question 7: What species is the best match protein from?

Part II. EMBOSS Pairwise Sequence Alignment Tool: Global and Local Alignments

In an alignment, nucleotides or amino acids are aligned so that nucleotides with the same evolutionary origins are lined up. Aligning known homologs is biologically useful to determine which regions are conserved, and therefore more likely to be functionally important.

Next you will experiment with aligning two known homologs against each other. Below are the GLUT1 protein sequences for mouse and fly:

>gi|22094111|ref|NP_035530.1| solute carrier family 2 (facilitated glucose transporter), member 1 [Mus musculus] MDPSSKKVTGRLMLAVGGAVLGSLQFGYNTGVINAPQKVIEEFYNQTWNHRIGEPIPSTTLTTLWSLSVA IFSVGGMIGSFSVGLFVNRFGRRNSMLMMNLLAFVAAVLMGFSKLGKSFEMLILGRFIIGVYCGLTTGFV PMYVGEVSPTALRGALGTLHQLGIVVGILIAQVFGLDSIMGNADLWPLLLSVVFVPALLQCILLPFCPES PRFLLINRNEENRAKSVLKKLRGTADVTRDLQEMKEEGRQMMREKKVTILELFRSPAYRQPILIAVVLQL SQQLSGINAVFYYSTSIFEKAGVQQPVYATIGSGIVNTAFTVVSLFVVERAGRRTLHLIGLAGMAGCAVL MTIALALLERLPWMSYLSIVAIFGFVAFFEVGPGPIPWFIVAELFSQGPRPARIAVAGFSNWTSNFIVGM CFQYVEQLCGPYVFIIFTVLLVLFFIFTYFKVPETKGRTFDEIASGFRQGGASQSDKTPEELFHPLGADS QV >gi|45551496|ref|NP_728558.2| Glucose transporter 1 CG1086-PA, isoform A [Drosophila melanogaster] MCTAGQNNDMATIGDLSMISPPTSSISNDQDPFGQLPPLPPPLRSTQVLQPLSVFPVSNLSEDSYDYVFG GRRKTPPTTTSTQLKLTSPPVRLRPEDAYRGANINNGRFYRHSFSYAPKRQRHSSRDDRDRESRLRCHGE DEATLRQLLLDLQKQVSVMSMNLSAKLDELQRGDRHLETTVALCEIRTQLQELTKSVESCQSEVSEVKRD MVAIKHELDTVQQVKEEIEELREYVDRLEEHTHRRKLRLLEQGLTFFLTYSIFSAVLGMLQFGYNTGVIN APEKNIENFMKDVYKDRYGEDISEEFIQQLYSVAVSIFAIGGMLGGFSGGWMANRFGRKGGLLLNNVLGI AGACLMGFTKVSHSYEMLFLGRFIIGVNCGLNTSLVPMYISEIAPLNLRGGLGTVNQLAVTVGLLLSQVL GIEQILGTNEGWPILLGLAICPAILQLILLPVCPESPRYLLITKQWEEEARKALRRLRASGSVEEDIEEM RAEERAQQSESHISTMELICSPTLRPPLIIGIVMQLSQQFSGINAVFYYSTSLFMSSGLTEESAKFATIG IGAIMVVMTLVSIPLMDRTGRRTLHLYGLGGMFIFSIFITISFLIKASIVVSRILESCSCSCRVMPANVN AKMPASLGLHLFVPRPFSDLHMTLKS
  • Navigate to the EMBOSS Pairwise Alignment Tool available at the URL: http://www.ebi.ac.uk/Tools/psa/

  • Click on the “Protein” button under the “Global Alignment” and “Needle” section.

  • Paste or upload the sequences an alignment on the Mouse and Fly homologs of GLUT1.

  • Expand “STEP 2 - Set your pairwise alignment options” by clicking on “More options...”. Configure the alignment to use the Needleman-Wunsch algorithm with a “GAP OPEN” penalty of 10 and a “GAP EXTEND” penalty of 1.0, and make sure you are using the “BLOSUM62” substitution matrix.

  • Click “Submit” to start the search.

Question 8: What is the length of the global alignment? (Hint: % Identity, Similarity, and Gaps are all reported as fractions of the total alignment length)

Question 9: What is the “Score” and how was it calculated? What does it correspond to in the dynamic programming grids from the lecture?

  • Now run the local alignment again using the Smith-Waterman algorithm (local alignment): Go back to the front page (http://www.ebi.ac.uk/Tools/psa/), and click on the “Protein” button under “Local Alignment” and “Water” section.

  • Check the default setting for “STEP 2 - Set your pairwise alignment options” by clicking on “More options...”. Choose a “GAP OPEN” penalty of 10, a “GAP EXTEND” penalty of 1.0, and the “BLOSUM62” substitution matrix.

  • Run the alignment again.

Question 10: What is the length of this new alignment? What is the gap percentage of this alignment?

Question 11: Why are the lengths different using the two different methods?

Now, explore the effects of changing the gap-opening penalty.

Go back to the alignment form by using the back button of your browser. Under “STEP 2 - Set your pairwise alignment options”, change the “ GAP OPEN” penalty from 10 to 1. Run the alignment again.

Question 12: What is the gap percentage of this alignment (with GAP OPEN PANELTY==1)? Does the gap percentage increase or decrease compared to the previous alignment?

Question 13: Which of these two alignments is more biologically relevant? Why?

Hint: In terms of mutational process, are many small gaps or one large gap more likely?

Question 14: Now, change the substitution matrix to BLOSUM45. What is the new score?

Question 15: Next, change the substitution matrix to BLOSUM80. What is the new score?

Question 16: Does BLOSUM45 or BLOSUM80 result in a higher score? Did this surprise you initially? Does a higher raw score necessarily indicate a better alignment? Why or why not?

Question 17: Which substitution matrix (BLOSUM80 or BLOSUM45) would be preferable for aligning mouse and rat sequences? Why?

Question 18: Rerun this analysis using the PAM30 substitution matrix. What is the % identity, % similarity and total score of the resulting alignment?

Question 19: Rerun this analysis using the PAM70 substitution matrix. What is the % identity, % similarity and total score of the resulting alignment?

Question 20: Why does the PAM70 matrix result in a stronger alignment? In general, what is the difference between PAM matrices and BLOSUM matrices? Hint: look at the biological model used to generate the PAM matrices. See more information here: http://cshprotocols.cshlp.org/content/2008/6/pdb.ip59.long

Homework:

Task: Contained in homework13_mySeq.fasta is a nucleotide sequence from humans. You want to get the protein sequence of the corresponding mouse gene. Hint: You can do this in only one step using one of the programs under the Web BLAST heading.

Question 21: What is the name of this protein?

Question 22: Copy and paste the amino acid sequence from the corresponding mouse gene below.

Question 23: When you search for a protein sequence, BLAST will show you the domains your query sequence contains at the top of the page. A domain is a protein subunit with a defined structure that is found in the context of many different genes. What are the two longest domains found in this gene, according to BLAST?