Databases and Tools on the Internet

Publicly available databases are
expanding and improving at incredible rate

  • Many publicly available databases exist, many more being added
  • Aid in asking questions for many things from expression of RNA and/or protein to identification of relevant literature
  • Can range from easy to use to requiring a strong programming background
  • Lets see what questions and how these databases can help you with

What can they can do for you?

  • Why What kind of questions can this resource help you answer?
  • What: What is the resource made up of, what data does it contain and expose to you in a digested way?
  • Where: Where to access the resource
  • How: How to use the resource, I’ll go through a couple of example play questions which can be answered in each resource

Genotype-Tissue Expression (GTEx)

  • Why:
    • Get estimate of RNA expression of gene/isoform in a tissue of interest
    • See if any non-coding mutations associate with change in expression
  • What:
    • Expression across ~53 tissues and 544 Donors
    • Latest eQTL analysis for all tissues
    • Intuitive and reactive browser for eQTLs in genome and gene, isoform, and exon level expression
  • Where:
  • How:

Nextprot

  • Why:
    • Need to get a better understanding of a protein
    • What and where is it expressed?
    • What does it look like?
    • What might it be doing?
  • What:
    • Amalgamation of many data dumps, such as Human Proteome
    • Interface can be taxing on computer and may slow/crash
    • Very good way to get an overview of what’s known about a protein’s function
  • Where:
  • How:

Uniprot

  • Why:
    • Need to fetch protein sequences
    • What’s its general structure characteristics?
    • Want to find homologous proteins
    • What does the protein's 3D structure look like?
    • What Post translational modifications does it undergo?
  • What:
    • Lots of mass spec data, phylogenetic relationships, isoform and gene level data
    • Many phylogenetic databases, i've had good luck with InParanoid and EggNog
  • Where:
  • How:
    • Just search for your protein and scroll down!

StringDB

  • Why:
    • What proteins have been shown to associate with my protein?
    • What gets pulled down with it?
    • Has it been investigated in the context of another protein?
  • What:
    • A curated database of protein to protein interactions
    • Includes natural language processing of pubmed to find co-mentioned proteins and/or concepts
    • Includes and uses knowledge across species
  • Where:
  • How:

Short Read Archive (SRA)

  • Why:
    • Need to fetch sequencing data from a paper
  • What:
    • Most papers that generate any sequencing data are required to deposit it here. It is stored as SRA files.
    • Petabases of information
  • Where:
  • How:

Genome Expression Omnibus

Meta

  • Why:
    • Literature search
  • What:
    • AI organized pubmed
    • Can follow and combine "streams" into customized feeds
    • Can quickly find influential papers
    • A good starting point for new genes
  • Where:
  • How:

PRoteomics IDEntifications (PRIDE)

  • Why:
    • You have found a particular protein or peptide and want the raw data
    • Need to rerun MS/MS analysis to quantify a different set of peptides
  • What:
    • Database dump of MS/MS spectra for a large volume of experiments, similar to SRA for sequencing data
  • Where:
  • How:

Harmonizome

  • Why:
    • Need to get a general idea of a gene
    • Curious about what medical issues have been associated with it
    • What complexes has it been found to be a part of?
    • What drugs have been shown to affect its behaviour?
  • What:
    • "Search for genes or proteins and their functional terms extracted and organized from over a hundred publicly available resources"
    • X = gene, Y = database
    • Includes clinical information and pharmacological information related to the query gene as well
    • Awesome mobile interface!
    • http://amp.pharm.mssm.edu/Harmonizome/about
  • Where:
  • How:
    • Simply search for your gene and click through the available databases for that given gene (LOTS of stuff!)

Encode

UCSC Xena Browser

  • Why:
    • Need to look at expression differences between publicly available cohorts, such as cancer patients or geographic populations
    • Dont want to process or make the raw graphs yourself

In [7]:
from IPython.display import YouTubeVideo
YouTubeVideo("TSNc-EDjix4", start=0, autoplay=1, theme="light", color="red")


Out[7]:
  • What:
    • Amalgamation of many data dumps, such as Human Proteome
    • Interface can be taxing on computer and may slow/crash
    • Very good way to get an overview of what’s known about a protein’s function
  • Where:
  • How:

In [8]:
YouTubeVideo("go38U6iLjsw", start=0, autoplay=1, theme="light", color="red")


Out[8]:

Ensembl

  • Why:
    • How far back is your gene conserved?
    • Is it lost in some species?
    • What's its synteny in humans, other species?
    • Is the protein conserved? How far?
    • What different isoforms exist?
  • What:
    • HUGE number of different organisms and genome data
    • Much integrated knowledge surrounding phylogenetics and transcriptomics
    • Data which is accessible is enormous
    • Built in genome browser which can be used to compare a huge number of tracks
    • Can compare synteny in genome browser view across many species, being able to check for things such as conserved cis regulatory elements and/or synteny
  • Where:
  • How:

MobiDB

  • Why:
    • Does your protein have high mobility domains (does it wiggle lots)?
  • What:
    • A combination of curated (direct provided by DisProt), and indirect PDB NMR/Xray sources amalgamated into a more digestible and encompassing database
  • Where:
  • How:

The Human Protein Atlas

  • Why:
    • Where is your protein expressed?
    • Is it upregulated in specific cancers?
    • what does the histology look like?
  • What:
    • LOTS of staining of tissue slides.
    • Integrates well with protein levels from mass proteomics sources
  • Where:
  • How: