Imagine that you are a grad student who has just begun work in a mouse lab. Your advisor, Dr. Stoker, works on a novel mouse phenotype, which has been dubbed Vampiric. Mice with this phenotype have the physiological feature of exceptionally sharp teeth and the behavioral feature of biting other mice. Dr. Stoker has developed a genetic mutagenesis screen for this phenotype using sunlight as a negative selector. In one strain of Vampiric mice, the mutation has been narrowed to an approximately 250-kilobase (kb) region of chromosome 12. Your first task is to investigate what is known about candidate genes in this region.
Part I. Explore the basic functionality of the UCSC Genome Browser
Part II. Investigate the genes in Dr. Stoker’s candidate region.
To answer the following 3 questions, scroll down to the “Genes and Gene Prediction Tracks” and set RefSeq Genes to “pack”; and GeneScan Genes to “dense”; also scroll down to “mRNA and EST Tracks” and set Mouse mRNAs to “pack”; You may want to hide other tracks to simplify the view.
Q1. What RefSeq Mouse Genes are in this region?
RefSeq genes are a good place to start your search for candidate genes, however it is possible that the mutation for Vampiric is in an unknown gene. Note that the Mouse mRNA track has many more annotations than the RefSeq track. RefSeq is a manually curated non-redundant gene database, while GenBank is a larger database with potentially redundant experimental data. For each gene in RefSeq there is typically one or more corresponding mRNA in GenBank that aligns to the same position in the genome. However, some mRNAs in GenBank may be unconfirmed as real genes, and therefore do not have a corresponding RefSeq entry. Such mRNAs would be further candidates for your mutation.
Q2. Name an example GeneBank mRNAs in this region that does not correspond to a RefSeq genes. (Hint: Zoom in to see which mRNAs overlap with the smaller genes)
You should also consider the possibility that your mutation is in a gene that has never been characterized experimentally. GenScan is a computational tool for predicting genes, which we will discuss in more detail later in this course. For now, just take a look at the annotation track for genes predicted by GenScan. You should always take these predictions with a grain of salt, but they may be useful if you don’t get any interesting results from known genes or mRNAs.
Q3. How many GenScan predicted genes are in this region?
Part III. Get detailed information about one of the candidate genes, Pax9.
Turn on the UCSC Gene track to “pack” and click on the Pax9 to get more information. Now it’s time to look at what is known about our candidate genes. The RefSeq genes will have the most useful information as they are often based on multiple experiments and have been validated in some way.
Q4. What is the RefSeq Accession Number for Pax9?
Q5. What is the transcript size (the number of base pairs in the entire transcribed mRNA, including introns and untranslated regions) of Pax9?
Q6. Is Pax9 on the forward or reverse strand of chromosome 12?
Q5. What is the transcript size (the number of base pairs in the entire transcribed mRNA, including introns and untranslated regions) of Pax9?
Click on the “Genome Browser” link. This will take you back to the track view, but will now be zoomed in on this specific gene. The intron/exon structure of this gene is clearly shown in the RefSeq track. The thickest lines represent exons, the medium lines represent untranslated regions, and the thin lines represent introns, with arrows indicating the direction of transcription.
Q7. How many exons are in Pax9?
Return to the Pax9 information page, and click on the link embedded in the gene id under “Entrez Gene”. This brings you to an NCBI page with more detailed information on this gene. The NCBI Entrez Gene database is linked to GenBank, PubMed, and many other useful databases. Scroll down to the section marked “Genomic regions, transcripts, and products”. You should see an image composed with a green colored line for genomic information, purple for mRNA information and red for protein information. If you only see green lines (genes), click on them and the green line will expand to purple line and red line. Reconfirm the transcript length and the number of exons in Pax9 by mousing over the purple line.
Q8. How many nucleotides are there in Pax9 mRNA? If there are multiple isoforms, choose the one corresponding to the RefSeq transcript we looked at in the UCSC Browser. How many amino acid are there in Pax9 protein? Do they follow the 3:1 ratio (as 3 mRNA nucleotides code for 1 amino acid)? If not, do you know which biological process causes the discrepancy?
Right click on the purple line and go to “View & Tools” option. You will see many options to display the genomic, mRNA, and protein information of Pax9 in different format. Follow the link to FASTA View; you should obtain the same sequence information as found on the Genome Browser site.
Continue scrolling the NCBI gene page and investigating Pax9. Note that the GeneRIF, Gene Ontology, and Interactions, and Summary sections provide useful information on what is known about the function of this gene. Use the GeneRIF, Gene Ontology, and Interactions sections to answer the following questions about Pax9:
Q9. What evidence (if any) supports Pax9 as a likely mutant related to the Vampiric phenotype?
Q10. What other basic biological function (if any) does Pax9 have?
Q11. What genes or proteins (if any) does Pax9 interact with? (Hint: look under heading Interactions)
Part IV. Get information on Pax9 in other (non-mouse) species.
Imagine that you are able to confirm that Pax9 is in fact responsible for the Vampiric phenotype in Mouse. Now Dr. Stoker wants to try inducing this phenotype in other organisms. He asks you to find out which species have Pax9 genes in GenBank.
Dr. Stoker suggests trying to induce the Vampiric phenotype in human subjects but you point out that that would be against ethical guidelines. He agrees and suggests instead that you use the species Gallus gallus (Chicken).
Q12. What chromosome is Pax9 on in Chicken?
Q13. If you wanted to go back and view this region of the chicken genome in UCSC Genome Browser, what search string would you use?
List of Genomic Databases
NCBI Entrez - http://www.ncbi.nlm.nih.gov/sites/gquery - huge database that encompasses other databases, including:
ExPASy - http://expasy.org/ - Another large database encompassing other databases:
ENSEMBL - http://useast.ensembl.org/index.html - An alternative to RefSeq and UniProt Genome Browser - http://genome.ucsc.edu/ - Track-based portal to databases of genomic sequence and annotations GeneCards - http://www.genecards.org/ - Gene-centered portal to information from many other databases ENCODE - http://www.genome.gov/10005107 - Encyclopedia of DNA Elements HapMap - http://hapmap.ncbi.nlm.nih.gov/ - Database of human variation across populations Gene Ontology (GO) - http://www.geneontology.org/ - Hierarchy of gene annotations MGED - http://www.mged.org/ - Database of gene expression/microarray results
This list is by no means complete, for more databases see the most recent Database Summary Paper Alpha List: http://www.oxfordjournals.org/nar/database/a/
Your colleague has just finished an extensive karyotyping study across samples from many different types of human cancers. She specifically looked for regions of the genome that have a statistically significant rate of chromosomal aberrations (including inversions, deletions, and translocations). She has asked you to help her analyze her results, starting with a region she identified on chromosome 6, ranging from base pairs 108,510,000 to 109,500,000 using NCBI build 36 (hg18). Use UCSC Genome Browser and/or other public databases to view information about known genes in this region.
Q1. What are the genes in this region? (3 points)
Q2. Hypothesize which gene you think is the most likely candidate to be related to human cancers, and provide evidence from at least 3 different public databases. Be sure to include the URL to each database entry on which you base your answer. (7 points) [Hint: There is Phenotype and Disease Associations category on UCSC genome browser.]