Clustering the subsampled 1.3 M cells

The data consists in 20K Neurons, downsampled from 1.3 Million Brain Cells from E18 Mice and is freely available from 10x Genomics (here).


In [1]:
import numpy as np
import pandas as pd
import scanpy.api as sc

sc.settings.verbosity = 3  # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.settings.set_figure_params(dpi=70)  # dots (pixels) per inch determine size of inline figures
sc.logging.print_versions()


scanpy==1.2.2+73.g1812406 anndata==0.6.5+8.g1c05290 numpy==1.13.1 scipy==1.1.0 pandas==0.22.0 scikit-learn==0.19.1 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1 

In [2]:
adata = sc.read_10x_h5('./data/1M_neurons_neuron20k.h5')


reading ./data/1M_neurons_neuron20k.h5 Variable names are not unique. To make them unique, call `.var_names_make_unique`.
(0:00:03.45)

In [3]:
adata.var_names_make_unique()

In [4]:
adata


Out[4]:
AnnData object with n_obs × n_vars = 20000 × 27998 
    var: 'gene_ids'

Run standard preprocessing steps, see here.


In [5]:
sc.pp.recipe_zheng17(adata)


running recipe zheng17
    finished (0:00:04.44)

In [6]:
sc.tl.pca(adata)

In [7]:
sc.pp.neighbors(adata)


computing neighbors
    using 'X_pca' with n_pcs = 50
    finished (0:00:09.81) --> added to `.uns['neighbors']`
    'distances', weighted adjacency matrix
    'connectivities', weighted adjacency matrix

In [8]:
sc.tl.umap(adata)


computing UMAP
    finished (0:00:17.18) --> added
    'X_umap', UMAP coordinates (adata.obsm)

In [9]:
sc.tl.louvain(adata)


running Louvain clustering
    using the "louvain" package of Traag (2017)
    finished (0:00:03.69) --> found 22 clusters and added
    'louvain', the cluster labels (adata.obs, categorical)

In [10]:
sc.tl.paga(adata)


running partition-based graph abstraction (PAGA)
    finished (0:00:01.27) --> added
    'paga/connectivities', connectivities adjacency (adata.uns)
    'paga/connectivities_tree', connectivities subtree (adata.uns)

In [19]:
sc.pl.paga_compare(adata, edges=True, threshold=0.05)


--> added 'pos', the PAGA positions (adata.uns['paga'])

Now compare this with the reference clustering of PAGA preprint, Suppl. Fig. 12, available from here.


In [12]:
anno = pd.read_csv('/Users/alexwolf/Dropbox/1M/louvain.csv.gz', compression='gzip', header=None, index_col=0)

In [13]:
anno.columns = ['louvain_ref']

In [14]:
adata.obs['louvain_ref'] = anno.loc[adata.obs.index]['louvain_ref'].astype(str)

In [15]:
sc.pl.umap(adata, color=['louvain_ref'], legend_loc='on data')


... storing 'louvain_ref' as categorical

In [16]:
adata.write('./write/subsampled.h5ad')