02 - UniProt

Introduction

`UniProt` is a comprehensive protein sequence and annotation resource, and is a consortium between the European Bioinformatics Institute (EMBL-EBI), the Swiss Institute of Bioinformatics (SIB), and the Protein Information Resource (PIR). `UniProt` provides a database that unifies several legacy databases, uncluding Swiss-Prot, TrEMBL, iProClass and the PIR-PSD.

UniProt provides three key databases:

UniProtKB is likely to be the database you use most frequently to find information on gene product/protein molecular function. It is the central hub of functional information on proteins, and collates functional annotations from many other databases, ontologies, and references. It keeps records of how annotations are derived (e.g. experimentally or computationally), and is divided into two sections: one contains manually-annotated records (UniProtKB/Swiss-Prot), and a second contains computational annotations that are waiting for manual curation (UniProtKB/TrEMBL).

UniRef provides clustered sets of sequences from UniProtKB and UniParc. A number of clusterings at different stringencies are provided.

UniParc is a comprehensive, non-redundant database that contains most of the publicly-available protein sequences from a range of sources.

These databases can be queried in a number of ways, including:

  • At the UniProt website http://www.uniprot.org/ in your web browser
  • Sending requests to the UniProt website, using a programming language

Resources

The UniProt website

The landing page offers options for each of the three main databases: UniProtKB, UniRef, and UniParc. It also offers sets of complete proteomes for a range of organisms, and databases of proteins organised by supporting data, such as literature, taxonomic classification, and subcellular location.

Using UniProtKB

  • Click on the UniProtKB link. This will take you to the UniProtKB front page, with a summary of entries, and a number of links.

QUESTIONS
  1. How many records are in UniProtKB today?
  2. How many of those records have been manually reviewed? What proportion of the total database has been manually reviewed?
  3. Which organisms are most highly represented in the database, today?
  • Enter the word Kitasatospora in the search bar at the top of the page, and click on the Search button.

QUESTIONS
  1. How many entries are returned?
  2. How many of those entries have been manually reviewed? What proportion of the total is this?

Filtering Results

  • On the left of the page there's an option to filter "kitasatospora" as an organism or by taxonomy. Click on the organism filter.

QUESTIONS
  1. How many entries are returned?
  2. How many of those entries have been manually reviewed?
  3. How has the search term in the top bar changed? NOTE: these search term changes will be useful for querying UniProt programmatically.
  • At the top left of the page there's an option to filter only the manually reviewed entries. Click on this filter.

Inspecting an Entry

  • Click on the link/accession for the topmost entry.

QUESTIONS
  1. What kind of evidence is there for protein function?
  2. Kinetic information for this enzyme is drawn from which other database(s)?
  3. Are any protein structures available for this enzyme?
  • At the top of the page, there's a button marked History. Click on this button. A small window will open, with a link to Previous versions. Click on this link.

  • Click on the Compare button.

QUESTIONS
  1. When was the last change made to this record?
  2. What was the change?
  • Compare some previous records to the current record (e.g. this comparison).

QUESTIONS
  1. What kinds of changes do you see?

Advanced Searches in UniProtKB

At the top of the `UniProtKB` page you've probably noticed a drop-down button marked "Advanced". This lets you combine several search filters to conduct powerful searches, and hone in on the proteins most of interest to you in the UniProtKB database.

In this section, you'll use the advanced searches to identify candidate human proteins that are found in the nucleus, and have been associated with some disease activity or function.

  • Click on the UniProt logo to return to the landing page
  • Click on the "Advanced" drop-down to get the advanced searching interface

  • In the first field, select Organism [OS] with search term "Human". The dropdown will offer you several options as you type, but do not select them (you could have entered the organism "Homo sapiens" here, also).
  • In the next field, keep AND on the left, and select Subcellular location with search term "nucleus". The dropdown will offer several options but, again, do not select them. At this point, allow any assertion method for the evidence code.

  • Click on the plus sign (+) to get another search term field.
  • In the new field, keep AND on the left, and select Pathology & Biotech with class "Disease", and no search term. At this point, allow any assertion method for the evidence code.

  • Click on the magnifying glass to run the search.
QUESTIONS
  1. How many results do you see, today?
  2. What are the contents of the search bar? **NOTE: this will be useful for programmatic queries, later.**
  • Click on the "Advanced" drop down. You should see that the current search populates this dialogue box.
  • Change the Evidence option for the Pathology & Biotech part of the search to manual "Experimental" evidence.

QUESTIONS
  1. How many results do you see, now?
  2. What are the contents of the search bar? **NOTE: this will be useful for programmatic queries, later.**
  • Click on the "Advanced" drop down. You should see that the current search populates this dialogue box.
  • Change the Term option for the Pathology & Biotech part of the search to "melanoma".

QUESTIONS
  1. How many results do you see, now?
  2. How are these proteins associated with melanoma?
  3. What amino acid modifications have been found for these proteins?

Downloading UniProtKB Search Results

After the search above, you should be left with a small set of proteins that satisfy the following criteria:

  • They derive from Homo sapiens
  • They are annotated as being found in the nucleus (for which we allow any form of evidence)
  • They are associated with a disease process: melanoma, and there is manually-curated experimental evidence for this association

If we would like to download these records (or those from any other search), we have a number of options, which are obtained by clicking on the download button at the top of the search results.

  • You can download all your search results, or just those selected with checkboxes
  • Results can be downloaded compressed (gzipped) or as raw records
  • Results can be downloaded as:
    • sequence data (FASTA)
    • tabular form (Excel, tab-separated)
    • computer-readable (XML, RDF)

  • Download the search results as tab-separated, text, and FASTA format files
  • Inspect the contents of these files

QUESTIONS
  1. How do the contents of these files differ

Exercise 01 (15min)

Using the UniProtKB search tools, can you find and download sets of proteins that satisfy the following requirements:


Set 1

  • Derives from *Saccharomyces cerevisiae*
  • Is associated with a membrane
  • Is a transcriptional regulator

Set 2

  • Has a function associated with alginate
  • Is annotated with any biotechnological application

Set 3

  • Any manual annotation associated with biofuel as a biotechnological application

Set 4

  • Derives from mouse
  • Is an enzyme
  • Has not been manually reviewed
  • Is mentioned in a *Nature* publication
  • Is between 100 and 300aa in length