Introduction to single cell bioinformatics

Who am I?

Olga Botvinnik, 4th year PhD candidate in Bioinformatics in Prof. Gene Yeo's lab at UC San Diego. I study alternative splicing in single cells.

What this course will cover

Specific concepts

By the end of this course, you will be able to ...

  • Describe differences between supervised and unsupervised machine learning methods
  • Interpret axes of dimensionality reduction plots (PCA, ICA, t-SNE, MDS)
  • Compare different clustering algorithms (hierarchical, $k$-means) and their advantages and disadvantages
  • Compare classification methods (SVM and decision trees) and their advantages and disadvantages
  • Compare batch effect correction methods
  • Evaluate methods for reducing technical noise and identifying when they are appropriate
  • Evaluate methods for pseudotime ordering and whether a dataset satisfies the appropriate prerequisites
  • Identify key analysis branch points in single-cell papers

Conceptual understanding

Understanding is first, tools are secondary.

We will focus on learning what different algorithms are, and when you would apply them. This is more of a "theoretical bioinformatics" course than a practical one. This is on purpose - programming languages change, but math is forever! Even diamonds aren't truly forever - they have a favorable free energy of spontaneous transition to graphite, albeit with a high activation energy (>1000C).

Philosophy

Many people say they want to learn bioinformatics but never do. The only thing that works is when they're confronted with an immediate need -- then they have to learn it. My goal is to light the fire of the immediate need and provide you the conceptual understanding of what algorithms do what, so you may find tutorials and people to learn programming from, but not to teach programming itself.

What this course will not cover

Sequencing depth

The depth of sequencing depends on your problem and your library preparation and so on. If you're using unique molecular identifiers and sequencing only the 3' end, apparently 100,000 reads/cell is enough. If you're doing alternative splicing and are interested in full transcript coverage, then at least 10 million reads/cell is necessary (in my experience).

Choosing a mapper/aligner or gene expression quantifier

There are many resources comparing different mapping and gene expression algorithms, so I will not be covering this.

Collapsing genes/cells on barcodes (unique molecular identifiers, UMIs)

There are several different techniques for dealing with barcodes and the exact strategy you use depends a lot on your protocol and library prep methods and so we won't be getting into this.

Version control

Specifically, git. You did an exercise for this and you should use version control every time you write code! Unless you are a perfect programmer and always write your code exactly right the first time. Then you should be inventing programming languages (but still using version control).

Language wars

Python vs R vs MATLAB vs Fortran vs ...

This is taught in Python because it's what I know best but could be taught in any language.

What I expect from you

  • Being present
  • Patience with the course material
    • This is difficult stuff and it takes time to wrap your head around these complex concepts
  • Participating in discussion
  • Criticism and feedback on the course structure

What you can expect (demand!) from me

  • Patience with your questions
  • Some (but not all) of the answers

Introduction to single cell analysis

Why would you do single-cell analyses?

Single-cell analyses are primarly done to deconstruct a population (e.g. a tissue or 10cm plate) into its constituent parts (cells!)

Kolodziejczyk et al, Mol Cell (2015)

Timeline of single cell analysis

Kolodziejczyk et al, Mol Cell (2015)

Quality control

Kolodziejczyk et al, Mol Cell (2015)

Finding subpopulations

Characterizing subpopulations

We'll be covering what's outlined in blue boxes.