First compiled: May 3, 2017. Updated Mar 28, 2018.
See the same notebook for Scanpy 0.2.7.
This is the Scanpy benchmark for the Cell Ranger R analysis of Zheng et al., Nat. Comms. (2017) available from here. Compare the Scanpy version with the Cell Ranger version of this notebook.
The data is freely available [page/file] from the 10x homepage and from GitHub.
A more pedagogic tutorial can be found here.
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as pl
import scanpy.api as sc
sc.settings.verbosity = 1 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.settings.set_figure_params(dpi=80) # low dpi (dots per inch) yields small inline figures
sc.logging.print_versions()
results_file = './write/zheng17.h5ad'
Initial usage of memory.
In [2]:
sc.logging.print_memory_usage()
Only use the first n cells, set to None if you want all cells.
In [3]:
use_first_n_observations = None
Load the data. This takes a long time only when first reading the raw data from the .mtx text file. It's very fast through reading from the ./cache/ directory after that (change this using sc.settings.cachedir).
In [4]:
%%time
path = './data/zheng17_filtered_matrices_mex/hg19/'
adata = sc.read(path + 'matrix.mtx', cache=True).T # transpose the data
adata.var_names = pd.read_csv(path + 'genes.tsv', header=None, sep='\t')[1]
adata.obs_names = pd.read_csv(path + 'barcodes.tsv', header=None)[0]
In [5]:
adata.var_names_make_unique()
Let us use the cell type labels generated by Zheng et al. by correlating gene expression with purified bulk data. You can download the bulk labels here ./data/zheng17_bulk_lables.txt.
In [6]:
adata.obs['bulk_labels'] = pd.read_csv('./data/zheng17_bulk_lables.txt', header=None)[0].values
Reduce the number of observations for scaling information.
In [7]:
adata = adata[:use_first_n_observations]
Save the logarithmized raw data for differential expression testing and plotting.
In [8]:
sc.pp.log1p(adata, copy=True).write('./write/zheng17_raw.h5ad')
In [10]:
adata
Out[10]:
Instead of calling all steps of this preprocessing section separtely, you can call
Per-cell normalize the data matrix $X$ and identify highly-variable genes.
In [9]:
%%time
sc.pp.filter_genes(adata, min_counts=1) # only consider genes with more than 1 count
sc.pp.normalize_per_cell(adata) # normalize with total UMI count per cell
filter_result = sc.pp.filter_genes_dispersion(
adata.X, flavor='cell_ranger', n_top_genes=1000, log=False)
sc.logging.print_memory_usage()
Plot the dispersion relation. Use a logarithmic scale as the data is not yet logarithmized.
In [10]:
sc.pl.filter_genes_dispersion(filter_result, log=True)
Normalize the data again, logarithmize and scale the data. Then compute the PCA.
In [11]:
%%time
adata = adata[:, filter_result.gene_subset] # filter genes
sc.pp.normalize_per_cell(adata) # need to redo normalization after filtering
sc.pp.log1p(adata) # log transform: X = log(X + 1)
sc.pp.scale(adata)
# the PCA is *not* contained in the recipe sc.pp.recipe_zheng17(adata)
sc.tl.pca(adata, n_comps=50)
sc.logging.print_memory_usage()
Plot the PCA results.
In [14]:
sc.pl.pca_loadings(adata)
This will construct the single-cell graph - usually a knn graph - that describes cells in their relation to their neighbors.
In [15]:
%%time
sc.pp.neighbors(adata)
sc.logging.print_memory_usage()
In [16]:
%%time
sc.tl.umap(adata)
sc.logging.print_memory_usage()
In [23]:
sc.pl.umap(adata, color='bulk_labels')
Instead of providing a simple k-means clustering as in Cell Ranger, we use the Louvain algorithm of Blondel et al. (2008), as suggested by Levine et al. (2015) for single-cell analysis.
In [21]:
%%time
sc.tl.louvain(adata, resolution=0.3)
sc.logging.print_memory_usage()
Plot the clusters.
In [22]:
sc.pl.umap(adata, color='louvain')
Use the logarithmized raw data to rank genes according to differential expression.
In [24]:
adata.raw = sc.read('./write/zheng17_raw.h5ad')
In [25]:
%%time
sc.tl.rank_genes_groups(adata, 'louvain', groups=['1', '6', '9', '7'])
sc.logging.print_memory_usage()
In [26]:
sc.pl.rank_genes_groups(adata, n_genes=30)
We do not want to save it the raw data in the same file... to avoid it, set it to None again.
In [6]:
adata.raw = None
Compute tSNE. This can, even on a single core, be sped up significantly by installing https://github.com/DmitryUlyanov/Multicore-TSNE, which is automatically detected by Scanpy.
In [21]:
%%time
sc.tl.tsne(adata)
sc.logging.print_memory_usage()
In [22]:
sc.pl.tsne(adata, color='bulk_labels', legend_loc='on data')
In [23]:
adata.write(results_file)