Author: Davide Cittaro
First compiled: February 9, 2018.
This is a Scanpy demo that shows how to regress cell cycle effect, following the approach showed in Seurat's vignette. As for the R example, toy dataset consists of murine hematopoietic progenitors from Nestorowa et al., Blood 2016. The files of the Seurat tutorial - used here for reasons of benchmarking - can be downloaded here. A more recent version of the dataset can be downloaded here.
In [1]:
import scanpy.api as sc
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.settings.set_figure_params(dpi=80) # low dpi (dots per inch) yields small inline figures
sc.logging.print_version_and_date()
sc.logging.print_versions_dependencies_numerics()
Load data
In [2]:
adata = sc.read_csv('./data/nestorawa_forcellcycle_expressionMatrix.txt', delimiter='\t').T
Load cell cycle genes defined in Tirosh et al, 2015. It is a list of 97 genes, represented by their gene symbol. The list here is for humans, in case of alternate organism, a list of ortologues should be compiled. There are major differences in the way Scanpy and Seurat manage data, in particular we need to filter out cell cycle genes that are not present in our dataset to avoid errors.
In [3]:
cell_cycle_genes = [x.strip() for x in open('./data/regev_lab_cell_cycle_genes.txt')]
Here we define two lists, genes associated to the S phase and genes associated to the G2M phase
In [4]:
s_genes = cell_cycle_genes[:43]
g2m_genes = cell_cycle_genes[43:]
cell_cycle_genes = [x for x in cell_cycle_genes if x in adata.var_names]
Standard filters applied. Note that we do not extract variable genes and work on the whole dataset, instead. This is because, for this demo, almost 70 cell cycle genes would not be scored as variable. Cell cycle scoring on ~20 genes is ineffective.
In [5]:
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)
Log-transformation of data and scaling should always be performed before scoring
In [6]:
sc.pp.log1p(adata)
sc.pp.scale(adata)
We here perform cell cycle scoring. The function is actually a wrapper to sc.tl.score_gene_list
, which is launched twice, to score separately S and G2M phases. Both sc.tl.score_gene_list
and sc.tl.score_cell_cycle_genes
are a port from Seurat and are supposed to work in a very similar way.
To score a gene list, the algorithm calculates the difference of mean expression of the given list and the mean expression of reference genes. To build the reference, the function randomly chooses a bunch of genes matching the distribution of the expression of the given list.
Cell cycle scoring adds three slots in data, a score for S phase, a score for G2M phase and the predicted cell cycle phase.
In [7]:
sc.tl.score_genes_cell_cycle(adata, s_genes=s_genes, g2m_genes=g2m_genes)
Here comes another difference from Seurat. The R package stores raw data, scaled data and variable genes information in separate slots, Scanpy instead keeps only one snapshot of the data. This implies that PCA is always calculated on the entire dataset. In order to calculate PCA reduction using only a subset of genes (like cell_cycle_genes
), a trick should be used.
Basically we create a dummy object to store information of PCA projection, which is then reincorporated into original dataset.
In [8]:
adata_cc_genes = adata[:, cell_cycle_genes]
sc.tl.pca(adata_cc_genes)
sc.pl.pca_scatter(adata_cc_genes, color='phase')
As in the original vignette, cells can be easily separated by their cell cycle status when cell cycle genes are used. Now we can regress out both S score and G2M score.
In [9]:
sc.pp.regress_out(adata, ['S_score', 'G2M_score'])
sc.pp.scale(adata)
Finally, we reproject dataset using cell cycle genes again. Since we regressed the scores, no effect of cell cycle is now evident.
In [10]:
adata_cc_genes = adata[:, cell_cycle_genes]
sc.tl.pca(adata_cc_genes)
sc.pl.pca_scatter(adata_cc_genes, color='phase')