Yay Bioinformatics!

Instructors:

  • Olga Botvinnik, PhD
  • Emily Wheeler
  • Alain Domissy

Bioinformatics homework

  1. How many people started the course?
  2. How many people finished?
  3. What did you think?

Python for Genomic Data Analysis

Command Line Tools for Bioinformatics

Stickie notes

We will be using stickie notes (blue for done, red for still working) to help us instructors scan the room and see how the groups are doing.

Slack channel

Throughout the course, the best way to reach us bioinformatics instructors will be via Slack (cshl-singlecell-2017.slack.com). It'll also be helpful for asking questions during lectures or hands-on stuff. I also highly encourage you to chat with your partner there!

Why single-cell RNA-Seq? Why not proteomics, genomics, etc?

While we are primarily discussing single-cell RNA-Seq analyses, many of the same concepts can be applied to other single-cell 'omics methods: genomics, epigenomics, proteomics, metabolomics, etc. We're merely using RNA-seq as an example, and so you don't have to keep switching units throughout the course (minimizes cognitive overload).

Why Python 3? Why Jupyter Notebooks?

We will be using Python as it is a very easy language to learn and has a rich library of scientific and machine learning resources far beyond biology, due to its ease of use. Thus there are many many machine learning libraries that we can take advantage of. It's not as bio-specific as the R/Bioconductor ecosystem, but in my opinion, you can always jump to R/Bioconductor when necessary, and read the output files in Python.

We will be using Python 3 because you are the future of science, not the past. Python 3 has much better memory management (important for huge datasets of single cells!) and has been around for almost a decade, biut many scientists have not yet switched. Python 2 is very entrenched in the scientific world, where many labs have piles of code in Python 2 and won't update because it takes time to update your code.

We will be using Jupyter Notebooks (rather than a studio environment like RStudio/MATLAB/Spyder) for the analyses because they will document your entire thought process throughout the exploratory and learning phase. This way you have files on your computer that you can take home and reference in the future. In Jupyter, you can take notes in plain text and remind yourself of why you were doing this in the first place :)

Your project: Reproduce a paper

You and your partner will pick a paper of interest and as we learn methods of analysis, reproduce the figures from the paper as best you can. You will present your results at the final student data presentation.

Homework until next time: Pick a paper and download its data (counts/expression matrix - not raw fastqs). Look for the "Accession" number in the paper and google it.

Schedule of topics

  1. (Emily) Where did this count matrix come from?
    1. Overview of single-cell RNA-Seq analysis pipeline
    2. Hands on with cellranger pipeline (Alain)
  2. “Fishing expedition:” Take a tissue, dissociate it → ?? → subpopulations!
    1. Paper: Macosko - 2015 - “Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets” (aka “the dropseq paper”) https://www.ncbi.nlm.nih.gov/pubmed/26000488
    2. Optional Reviews: Bacher and Kendziorski - 2016 - “Design and computational analysis of single-cell RNA-sequencing experiments” https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0927-y Kolodziejczyk et al - 2015 - “The Technology and Biology of Single-Cell RNA Sequencing” http://www.cell.com/molecular-cell/fulltext/S1097-2765(15)00261-0
      1. Question: How does gene dropout affect your results? How can you use computational methods to deal with dropout?
        1. Relationships between cells
          1. Linear vs nonlinear correlation
          2. Distance metrics
          3. Hierarchical clustering
        2. Interactive exploration of dropout and clusteirng
        3. Reproduce Figure 5
          1. A hint of compressed sensing/Robust PCA
  3. Case vs Control
    1. Paper: Segerstolpe - 2016 - “Single-Cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 Diabetes” http://www.cell.com/cell-metabolism/fulltext/S1550-4131(16)30436-3
    2. Optional Review: Stegle et al - 2015 - “Computational and analytical challenges in single-cell transcriptomics” https://www.ncbi.nlm.nih.gov/pubmed/25628217
      1. Question: How can you find structure in high dimensional data?
        1. Finding structure in your data: Clustering
          1. Smushing your data via linear vs nonlinear dimensionality reduction (PCA vs TSNE)
          2. Finding groups of cells via linear vs nonlinear clustering (KMeans vs DBSCAN)
          3. Interactive exploration of dimensionality reduction + clustering
        2. Reproduce Figure 1
  4. Molecular transformations over time (“pseudo-time”) (Advanced)
    1. Paper: Lönnberg et al - 2017 - “Single-cell RNA-seq and computational analysis using temporal mixture modelling resolves Th1/Tfh fate bifurcation in malaria” https://www.ncbi.nlm.nih.gov/pubmed/28345074
    2. Optional Reviews: Canoodt et al - 2016 - “Computational methods for trajectory inference from single-cell transcriptomics” http://onlinelibrary.wiley.com/doi/10.1002/eji.201646347/epdf Symmons and Raj - 2016 - “What’s Luck Got to Do with It: Single Cells, Multiple Fates, and Biological Nondeterminism” https://www.ncbi.nlm.nih.gov/pubmed/27259209
      1. Question: How can cells be ordered along differentiation, e.g. from young to old, or early to late?
        1. Finding structure in your data: Regression
          1. Regression demo
          2. Discussion of "psudeotime" methods
        2. Reproduce Figure 5
  5. Perturbation (Advanced)
    1. Paper: Adamson and Norman et al - 2016 - “A Multiplexed Single-Cell CRISPR Screening Platform Enables Systematic Dissection of the Unfolded Protein Response” https://www.ncbi.nlm.nih.gov/labs/articles/27984733/
    2. Optional Review: Tanay and Regev - 2017 - “Scaling single-cell genomics from phenomenology to mechanism” https://www.ncbi.nlm.nih.gov/pubmed/28102262 Wagner and Klein - 2017 - “Genetic screening enters the single-cell era” http://www.nature.com/doifinder/10.1038/nmeth.4196
    3. Question:
      1. Which genes actually matter?
        1. Finding important genes via linear (SVM) and nonlinear (Decision Trees) classifiers
        2. Compressed sensing: Robust PCA
        3. Gene ontology
      2. Reproduce Figure 3

In [ ]: