Databases and Tools on the Internet

Publicly available databases are
expanding and improving at incredible rate

Many publicly available databases exist, many more being added
Aid in asking questions for many things from expression of RNA and/or protein to identification of relevant literature
Can range from easy to use to requiring a strong programming background
Lets see what questions and how these databases can help you with

What can they can do for you?

Why What kind of questions can this resource help you answer?
What: What is the resource made up of, what data does it contain and expose to you in a digested way?
Where: Where to access the resource
How: How to use the resource, I’ll go through a couple of example play questions which can be answered in each resource

Genotype-Tissue Expression (GTEx)

Why:
- Get estimate of RNA expression of gene/isoform in a tissue of interest
- See if any non-coding mutations associate with change in expression
What:
- Expression across ~53 tissues and 544 Donors
- Latest eQTL analysis for all tissues
- Intuitive and reactive browser for eQTLs in genome and gene, isoform, and exon level expression
Where:
- http://www.gtexportal.org/
How:
- https://www.gtexportal.org/home/documentationPage

Nextprot

Why:
- Need to get a better understanding of a protein
- What and where is it expressed?
- What does it look like?
- What might it be doing?
What:
- Amalgamation of many data dumps, such as Human Proteome
- Interface can be taxing on computer and may slow/crash
- Very good way to get an overview of what’s known about a protein’s function
Where:
- http://www.nextprot.org/
How:
- https://www.nextprot.org/help/simple-search

Uniprot

Why:
- Need to fetch protein sequences
- What’s its general structure characteristics?
- Want to find homologous proteins
- What does the protein's 3D structure look like?
- What Post translational modifications does it undergo?
What:
- Lots of mass spec data, phylogenetic relationships, isoform and gene level data
- Many phylogenetic databases, i've had good luck with InParanoid and EggNog
Where:
- http://www.uniprot.org/
How:
- Just search for your protein and scroll down!

StringDB

Why:
- What proteins have been shown to associate with my protein?
- What gets pulled down with it?
- Has it been investigated in the context of another protein?
What:
- A curated database of protein to protein interactions
- Includes natural language processing of pubmed to find co-mentioned proteins and/or concepts
- Includes and uses knowledge across species
Where:
- https://string-db.org/
How:
- https://string-db.org/cgi/help.pl

Short Read Archive (SRA)

Why:
- Need to fetch sequencing data from a paper
What:
- Most papers that generate any sequencing data are required to deposit it here. It is stored as SRA files.
- Petabases of information
Where:
- https://trace.ncbi.nlm.nih.gov/Traces/sra/
How:
- this pipeline shows how to pull and process reads from SRA

Genome Expression Omnibus

Why:
- Need to fetch quantified values such as, Wig, FPKM, BedPeaks...
What:
- Data dump of usually a more quantified version of SRA
- Things such as FPKM, bed peaks
- Also a big repository for microarray data
Where:
- https://www.ncbi.nlm.nih.gov/geo/
How:
- https://www.ncbi.nlm.nih.gov/geo/info/download.html

PRoteomics IDEntifications (PRIDE)

Why:
- You have found a particular protein or peptide and want the raw data
- Need to rerun MS/MS analysis to quantify a different set of peptides
What:
- Database dump of MS/MS spectra for a large volume of experiments, similar to SRA for sequencing data
Where:
- https://www.ebi.ac.uk/pride/archive//
How:

Harmonizome

Why:
- Need to get a general idea of a gene
- Curious about what medical issues have been associated with it
- What complexes has it been found to be a part of?
- What drugs have been shown to affect its behaviour?
What:
- "Search for genes or proteins and their functional terms extracted and organized from over a hundred publicly available resources"
- X = gene, Y = database
- Includes clinical information and pharmacological information related to the query gene as well
- Awesome mobile interface!
- http://amp.pharm.mssm.edu/Harmonizome/about
Where:
- http://amp.pharm.mssm.edu/Harmonizome/
How:
- Simply search for your gene and click through the available databases for that given gene (LOTS of stuff!)

Encode

Why:
- Need to find Chip-Seq, RNA-seq, DNA-seq/ATAC-seq, etc genome track to look at a particular locus in a particular cell type
What:
- Database dump of mostly Chip-Seq, RNA-seq, and Chromatin Accessibility processed files
Where:
- https://www.encodeproject.org/
How:
- https://www.encodeproject.org/help/getting-started/

UCSC Xena Browser

Why:
- Need to look at expression differences between publicly available cohorts, such as cancer patients or geographic populations
- Dont want to process or make the raw graphs yourself



In [7]:

    
from IPython.display import YouTubeVideo
YouTubeVideo("TSNc-EDjix4", start=0, autoplay=1, theme="light", color="red")









    Out[7]:

What:
- Amalgamation of many data dumps, such as Human Proteome
- Interface can be taxing on computer and may slow/crash
- Very good way to get an overview of what’s known about a protein’s function
Where:
- http://www.nextprot.org/
How:



In [8]:

    
YouTubeVideo("go38U6iLjsw", start=0, autoplay=1, theme="light", color="red")









    Out[8]:

Ensembl

Why:
- How far back is your gene conserved?
- Is it lost in some species?
- What's its synteny in humans, other species?
- Is the protein conserved? How far?
- What different isoforms exist?
What:
- HUGE number of different organisms and genome data
- Much integrated knowledge surrounding phylogenetics and transcriptomics
- Data which is accessible is enormous
- Built in genome browser which can be used to compare a huge number of tracks
- Can compare synteny in genome browser view across many species, being able to check for things such as conserved cis regulatory elements and/or synteny
Where:
- http://www.ensembl.org/index.html
How:
- http://www.ensembl.org/info/website/index.html

MobiDB

Why:
- Does your protein have high mobility domains (does it wiggle lots)?
What:
- A combination of curated (direct provided by DisProt), and indirect PDB NMR/Xray sources amalgamated into a more digestible and encompassing database
Where:
- http://mobidb.bio.unipd.it/about
How:
- CTCF

The Human Protein Atlas

Why:
- Where is your protein expressed?
- Is it upregulated in specific cancers?
- what does the histology look like?
What:
- LOTS of staining of tissue slides.
- Integrates well with protein levels from mass proteomics sources
Where:
- http://www.proteinatlas.org/
How:
- CTCF

Databases and Tools on the Internet

Publicly available databases are expanding and improving at incredible rate

What can they can do for you?

Genotype-Tissue Expression (GTEx)

Nextprot

Uniprot

StringDB

Short Read Archive (SRA)

Genome Expression Omnibus

Meta

PRoteomics IDEntifications (PRIDE)

Harmonizome

Encode

UCSC Xena Browser

Ensembl

MobiDB

The Human Protein Atlas

Publicly available databases are
expanding and improving at incredible rate