This notebook details the usage of SCAnalysis for single cell RNA-seq data.
To view directly: https://nbviewer.jupyter.org/github/helenjin/scanalysis/blob/master/notebooks/SCAnalysis.ipynb (ignore if already here)
SCAnalysis is a package for analyzing single cell data. It includes the Wishbone, MAGIC, and Palantir packages:
Wishbone is an algorithm to identify bifurcating developmental trajectories from single cell data. Wishbone can applied to single cell RNA-seq (not currently supported for mass cytometry datasets)
MAGIC (Markov-Affinity Based Graph Imputation of Cells) is an interactive tool to impute missing values in single-cell data and restore the structure of the data. It also provides data preprocessing functionality such as dimensionality reduction and gene expression visualization.
Palantir
First, import the package.
In [1]:
import scanalysis
Then, you can load the data using the load function in the loadsave file of the io folder. Here, we will be using the sample_scseq_data.csv data provided in the data folder as an example.
In [2]:
df = scanalysis.io.loadsave.load("~/scanalysis/data/sample_scseq_data.csv")
Also, import plotting and miscellaneous.
In [16]:
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
%matplotlib inline
In [4]:
fig, ax = scanalysis.plots.plot.plot_molecules_per_cell_and_gene(df)
From these histograms, choose the appropriate cutoffs to filter the data. In this case, the data has already been filtered.
In [5]:
# Minimum molecules/cell value
CELL_MIN = 0
# Maximum molecules/cell values
CELL_MAX = 1000000
# Minimum number of nonzero cells/gene
# (None if no filtering desired)
GENE_NONZERO = None
# Minimum number of molecules/gene
# (None if no filtering desired)
GENE_MOLECULES = None
In [6]:
df = scanalysis.io.preprocess.filter_scseq_data(df, filter_cell_min=CELL_MIN, filter_cell_max=CELL_MAX,
filter_gene_nonzero=GENE_NONZERO, filter_gene_mols=GENE_MOLECULES)
In [7]:
data = scanalysis.io.preprocess.normalize_scseq_data(df)
The first step in data processing for Wishbone is to determine metagenes using principal component analysis. This representation is necessary to overcome the extensive dropouts that are pervasive in single cell RNA-seq data.
For a visual representation of PCA results, see PCA visualization. However, note that the PCA visualization functions already run PCA within themselves, so there is no need to run PCA separately beforehand.
In [8]:
r1, r2 = scanalysis.utils.pca.run_pca(data)
temp is the data after PCA is run on it.
In [9]:
import wishbone
import os
scdata = wishbone.wb.SCData.from_csv(os.path.expanduser('~/.wishbone/data/sample_scseq_data.csv'), data_type='sc-seq', normalize=True)
scdata.run_pca()
In [10]:
from copy import deepcopy
import numpy as np
import pandas as pd
n_pca_components = 5
temp = deepcopy(scdata.data)
temp -= np.min(np.ravel(temp))
temp /= np.max(np.ravel(temp))
temp = pd.DataFrame(np.dot(temp, scdata.pca['loadings'].iloc[:, 0:n_pca_components]),
index=scdata.data.index)
Diffusion maps is a non-linear dimensionality reduction technique to denoise the data and capture the major axes of variation. Diffusion maps can be determined by using the run_diffusion_map function and the diffusion components visualized on tSNE maps using plot_diffusion_components. See Diffusion map visualization
Note: PCA must be run separately on data before diffusion maps (i.e. PCA is not included in diffusion maps function)
In [11]:
tempEigvec, tempEigval = scanalysis.utils.diffusionmap.run_diffusion_map(temp)
Note: PCA must be run separately on data before tSNE (i.e. PCA is not included in tSNE function)
For a visual representation of tSNE results, see tSNE visualization
Here, temp data has already been run through PCA, so we can simply apply tSNE:
In [12]:
t = scanalysis.utils.tsne.TSNE()
d = t.fit_transform(temp)
In [13]:
t1 = scanalysis.utils.tsne.TSNE()
d1 = t1.fit_transform(r1)
Data can be saved to a pickle file and loaded using the save and load functions. For example, data can be saved as "mouse_marrow_scdata.p" (which will appear in the notebooks folder of scanalysis)
In [14]:
scanalysis.io.loadsave.save(data, 'mouse_marrow_scdata.p')
Note: Run the plot_pca_variance_explained function WITHOUT running PCA on the data beforehand, since PCA will be run automatically.
Results shown below for plot_pca_variance_explained_v1, which is Wishbone's version of the function.
In [15]:
fig, ax = scanalysis.plots.plot.plot_pca_variance_explained_v1(data, n_components=40, random=True)
Results shown below for plot_pca_variance_explained_v2, which is MAGIC's version of the function.
In [16]:
fig, ax = scanalysis.plots.plot.plot_pca_variance_explained_v2(data, n_components=40, random=True)
Wishbone uses tSNE for visualization and tSNE can be run using the run_tsne function which takes the number of principal components as the parameter. From the above plot, 5 seems an appropriate number of components to use.
tSNE results can be visualized by the plot_tsne and plot_tsne_by_cell_sizes functions. The plot_tsne_by_cell_sizes function colors the cells by their molecule counts before normalization.
In [17]:
fig, ax = scanalysis.plots.plot.plot_tsne(d1)
In [18]:
fig = plt.figure(figsize=[5, 4])
scanalysis.plots.plot.plot_tsne_by_cell_sizes(df, d1, fig = fig)
Out[18]:
In [19]:
fig, ax = scanalysis.plots.plot.plot_gene_expression(data, d1, genes = ['CD34', 'GATA2', 'GATA1', 'MPO'])
Note: Please run diffusion maps and tSNE before plotting diffusion components (via plot_diffusion_components function).
In [20]:
fig, ax = scanalysis.plots.plot.plot_diffusion_components(d, tempEigvec, tempEigval)
The run_diffusion_map_correlations function is designed to work for single cell RNA-seq (not mass-cyt). Please run diffusion maps using run_diffusion_map before determining correlations.
Note: the component 0 is the trivial component and does not encode any information of the data.
In [21]:
dmap_corr = scanalysis.plots.plot.run_diffusion_map_correlations(data, tempEigvec)
After determining the diffusion map correlations, we can plot the gene component correlations (via plot_gene_component_correlations function).
In [22]:
scanalysis.plots.plot.plot_gene_component_correlations(dmap_corr)
Out[22]:
The enrichments can be determined using the run_gsea function. This function needs the prefix for generating GSEA reports and a gmt file representing the different gene sets. The following invocation of the function shows the supported set of gmt files.
Note: Please make sure to run run_diffusion_map_correlations() before running GSEA to annotate those components.
Note: The gmt files package with Wishbone/SCAnalysis assume all the gene names to be upper case. This can be ensured using the following code to convert them to upper case.
In [ ]:
data.columns = data.columns.str.upper()
In [ ]:
scanalysis.tools.wb.gsea.run_gsea(dmap_corr, output_stem= os.path.expanduser('~/.scanalysis/tools/gsea/mouse_marrow'))
Since this is data from mouse, gmt_file parameter can be set to (mouse, gofat.bp.v1.0.gmt.txt)
In [ ]:
reports = scanalysis.tools.wb.gsea.run_gsea(dmap_corr, output_stem= os.path.expanduser('~/.scanalysis/gsea/mouse_marrow'),
gmt_file=('mouse', 'gofat.bp.v1.0.gmt.txt'))
The detailed reports can be found at ~/.wishbone/gsea/
In [ ]:
!open ~/.scanalysis/gsea/
run_gsea function also returns the top enrichment gene sets along each component. GSEA determines enrichments that are either positively or negatively correlated with the gene component correlations. In this dataset, components 1 and 2 show relevant enrichments and are used for running Wishbone/SCAnalysis. Please see Selection of diffusion components for single cell RNA-seq section of the Supplementary Methods for more details.
In [ ]:
# Component 1 enrichments
reports[1]['neg']
In [ ]:
# Component 2 enrichments
reports[2]['pos']
Wishbone can be run by specifying the start cell and number of waypoints to be used. The start cell for this dataset was chosen based on high expression of CD34. (for each dataset, there is a corresponding start cell particular to that dataset)
Note: Keep in mind that Wishbone requires data that has been run through normalization, PCA, and diffusion maps.
Here, we will consider only 2 components.(?)
In [23]:
wb = scanalysis.tools.wb.wishbone.run_wishbone(tempEigvec.iloc[:,[1,2]], 'W30258', k=15, l=15, num_waypoints =250, branch=True)
Use the WBResults class object returned from the run_wishbone function to plot various graphs.
Wishbone trajectory and branch results can be visualized on tSNE maps using the plot_wishbone_on_tsne function.
Note: Please make sure to run Wishbone before attempting to plot Wishbone results.
In [24]:
wb.plot_wishbone_on_tsne(d)
Out[24]:
Gene expression trends along the Wishbone trajectory can be visualized using the plot_marker_trajectory function. This function also returns the smoothed trends along with the matplotlib fig, ax handler objects.
Note: Variance calculation is currently not supported for single-cell RNA-seq (sc-seq)
In [25]:
vals, fig, ax = wb.plot_marker_trajectory(data, ['CD34', 'GATA1', 'GATA2', 'MPO']);
In [26]:
wb.plot_marker_heatmap(vals)
Out[26]:
In [27]:
wb.plot_marker_heatmap(vals, trajectory_range=[0.1, 0.6])
Out[27]:
In [28]:
wb.plot_derivatives(vals)
Out[28]:
In [29]:
wb.plot_derivatives(vals, trajectory_range=[0.3, 0.6])
Out[29]:
MAGIC can be run with the run_magic function.
Note: Data should be filtered and normalized before running MAGIC. Running PCA is not necessary, since the run_magic function automatically performs PCA.
In [30]:
new_data = scanalysis.tools.magic.run_magic(data)
In [31]:
m_data = scanalysis.io.loadsave.load("~/sdata_nn_TGFb_day_8_10.csv")
We have to filter the data, but we won't do it here because it's not necessary.
In [ ]:
# Minimum molecules/cell value
CELL_MIN = 0
# Maximum molecules/cell values
CELL_MAX = 1000000
# Minimum number of nonzero cells/gene
# (None if no filtering desired)
GENE_NONZERO = None
# Minimum number of molecules/gene
# (None if no filtering desired)
GENE_MOLECULES = None
m_data = scanalysis.io.preprocess.filter_scseq_data(m_data, filter_cell_min=CELL_MIN, filter_cell_max=CELL_MAX,
filter_gene_nonzero=GENE_NONZERO, filter_gene_mols=GENE_MOLECULES)
## ^but this takes forever...
### also Pooja doesn't actually filter the data in the example notebook
## (there are just dummy parameters as an example of how to call the filtering function), so you should still be able to tes
Let's normalize the data though.
In [33]:
m_data = scanalysis.io.preprocess.normalize_scseq_data(m_data)
Then, let's apply the run_magic function on m_data.
In [34]:
new_m_data = scanalysis.tools.magic.run_magic(m_data)
Note: Please make sure to run MAGIC on normalized data before attempting to plot various MAGIC results.
In [35]:
fig, ax = scanalysis.plots.plot.scatter_gene_expression(data, ['SRRM1', 'TAB2'], color = 'GPX4')
ax.set_xlabel('SRRM1')
ax.set_ylabel('TAB2')
Out[35]:
The second plot below is for the data set used in the original MAGIC notebook.
In [36]:
fig, ax = scanalysis.plots.plot.scatter_gene_expression(m_data, ['VIM', 'CDH1'], color='ZEB1')
ax.set_xlabel('Vimentin (VIM)')
ax.set_ylabel('E-cadherin (CDH1)')
Out[36]:
In [37]:
fig, ax = scanalysis.plots.plot.scatter_gene_expression(new_data, ['MAGIC SRRM1', 'MAGIC TAB2'], color = 'MAGIC GPX4')
ax.set_xlabel('MAGIC SRRM1')
ax.set_ylabel('MAGIC TAB2')
Out[37]:
The second plot below is for the data set used in the original MAGIC notebook.
In [38]:
fig, ax = scanalysis.plots.plot.scatter_gene_expression(new_m_data, ['MAGIC VIM', 'MAGIC CDH1'], color ='MAGIC ZEB1')
ax.set_xlabel('MAGIC Vimentin (VIM)')
ax.set_ylabel('MAGIC E-cadherin (CDH1)')
Out[38]:
In [39]:
fig, ax = scanalysis.plots.plot.scatter_gene_expression(data, ['SRRM1', 'TAB2', 'CBR1'], color='GPX4')
ax.set_xlabel('SRRM1')
ax.set_ylabel('TAB2')
ax.set_zlabel('CBR1')
Out[39]:
The second plot below is for the data set used in the original MAGIC notebook.
In [40]:
fig, ax = scanalysis.plots.plot.scatter_gene_expression(m_data, ['VIM', 'CDH1', 'FN1'], color='ZEB1')
ax.set_xlabel('Vimentin (VIM)')
ax.set_ylabel('E-cadherin (CDH1)')
ax.set_zlabel('Fibronectin (FN1)')
Out[40]:
In [41]:
fig, ax = scanalysis.plots.plot.scatter_gene_expression(new_data, ['MAGIC SRRM1', 'MAGIC TAB2', 'MAGIC CBR1'], color='MAGIC GPX4')
ax.set_xlabel('MAGIC SRRM1')
ax.set_ylabel('MAGIC TAB2')
ax.set_zlabel('MAGIC CBR1')
Out[41]:
The second plot below is for the data set used in the original MAGIC notebook.
In [42]:
fig, ax = scanalysis.plots.plot.scatter_gene_expression(new_m_data, ['MAGIC VIM', 'MAGIC CDH1', 'MAGIC FN1'], color='MAGIC ZEB1')
ax.set_xlabel('MAGIC Vimentin (VIM)')
ax.set_ylabel('MAGIC E-cadherin (CDH1)')
ax.set_zlabel('MAGIC Fibronectin (FN1)')
ax.set_zlim(35, 150)
Out[42]:
First, however, we should set up our data. Add to temp the column names of the various Principal Components.
In [43]:
temp.columns = ['PC1','PC2','PC3','PC4','PC5']
Then, combine data and temp into one combined dataset, which is assigned to variable x below.
In [44]:
x = pd.concat([data, temp], axis=1)
Finally, let's plot.
In [45]:
gs = gridspec.GridSpec(2,2)
fig = plt.figure(figsize=[15, 12])
genes = ['SRRM1', 'TAB2', 'CBR1', 'GPX4']
for i in range(len(genes)):
ax = fig.add_subplot(gs[i//2, i%2])
scanalysis.plots.plot.scatter_gene_expression(x, genes=['PC2', 'PC3'], color=genes[i], fig=fig, ax=ax)
The second plot below is for the data set used in the original MAGIC notebook.
In [46]:
pca1, pca2 = scanalysis.utils.pca.run_pca(m_data, 5)
pca1.columns = ['PC1','PC2','PC3','PC4','PC5']
x1 = pd.concat([m_data, pca1], axis=1)
In [47]:
x1
Out[47]:
In [48]:
gs = gridspec.GridSpec(2,2)
fig = plt.figure(figsize=[15, 12])
genes = ['CDH1', 'VIM', 'EZH2', 'ZEB1']
for i in range(len(genes)):
ax = fig.add_subplot(gs[i//2, i%2])
scanalysis.plots.plot.scatter_gene_expression(x1, genes=['PC2', 'PC3'], color=genes[i], fig=fig, ax=ax)
In [49]:
x2 = pd.concat([new_data, temp], axis=1)
x2
Out[49]:
In [50]:
gs = gridspec.GridSpec(2,2)
fig = plt.figure(figsize=[15, 12])
genes = ['MAGIC SRRM1', 'MAGIC TAB2', 'MAGIC CBR1', 'MAGIC GPX4']
for i in range(len(genes)):
ax = fig.add_subplot(gs[i//2, i%2])
scanalysis.plots.plot.scatter_gene_expression(x2, genes=['PC2', 'PC3'], color=genes[i], fig=fig, ax=ax)
The second plot below is for the data set used in the original MAGIC notebook.
In [51]:
x3 = pd.concat([new_m_data, pca1], axis=1)
In [52]:
gs = gridspec.GridSpec(2,2)
fig = plt.figure(figsize=[15, 12])
genes = ['MAGIC CDH1', 'MAGIC VIM', 'MAGIC EZH2', 'MAGIC ZEB1']
for i in range(len(genes)):
ax = fig.add_subplot(gs[i//2, i%2])
scanalysis.plots.plot.scatter_gene_expression(x3, genes=['PC2', 'PC3'], color=genes[i], fig=fig, ax=ax)
In [53]:
fig, ax = scanalysis.plots.plot.plot_gene_expression(data=data, tsne=d, genes=['SRRM1', 'TAB2', 'CBR1', 'GPX4'])
The second plot below is for the data set used in the original MAGIC notebook.
In [54]:
t1 = scanalysis.utils.tsne.TSNE()
d1 = t1.fit_transform(pca1)
In [55]:
fig, ax = scanalysis.plots.plot.plot_gene_expression(data=m_data, tsne=d1, genes=['CDH1', 'VIM', 'EZH2', 'ZEB1'])
In [56]:
fig, ax = scanalysis.plots.plot.plot_gene_expression(data=new_data, tsne=d, genes=['MAGIC SRRM1', 'MAGIC TAB2', 'MAGIC CBR1', 'MAGIC GPX4'])
The second plot below is for the data set used in the original MAGIC notebook.
In [57]:
fig, ax = scanalysis.plots.plot.plot_gene_expression(data=new_m_data, tsne=d1, genes=['MAGIC CDH1', 'MAGIC VIM', 'MAGIC EZH2', 'MAGIC ZEB1'])
In [58]:
scanalysis.plots.plot.savefig(fig, 'h')
First, load the pickle file with the data. The data is normalized and log transformed.
In [2]:
import scanalysis
In [3]:
mb_data = scanalysis.io.loadsave.load("~/mb_data.p")
In [4]:
import pandas as pd
import numpy as np
We need the data to be in cells x genes format (index x columns), so in this case we will switch the rows and columns to achieve this.
In [5]:
mb_data = pd.DataFrame.transpose(mb_data)
In [5]:
mb_data
Out[5]:
In [6]:
DMEigVals = scanalysis.io.loadsave.load("~/palantir/dm_eig_vals.csv")
In [7]:
DMEigs = scanalysis.io.loadsave.load("~/palantir/dm_eigs.csv")
Note: the run_multibranch function takes ~10 minutes.
Use the following parameters when running run_multibranch for this particular data set:
In [8]:
der = scanalysis.tools.pr.palantir.run_multibranch(data_ = mb_data, DMEigs = DMEigs, DMEigVals = DMEigVals, dm_eigs = [1,2,3], start_cell="Run5_126835192163230", num_waypoints = 300, flock = 0)
In [9]:
der.entropy
Out[9]:
In [18]:
der.branch_prob
Out[18]:
In [10]:
atrajectory = scanalysis.io.loadsave.load("~/palantir/trajectory.csv")
In [16]:
import matplotlib.pyplot as plt
%matplotlib inline
In [11]:
plt.scatter(der.trajectory, atrajectory)
Out[11]:
Here, we will try the first five genes in the data matrix. You shouldn't plot all the markers because that will kill your computer.
In [12]:
mb_data_f5 = mb_data.iloc[:,0:5]
In [17]:
mb_data_f5
Out[17]:
Run the plot_markers function to visualize results.
when compute_std parameter is set to True, results in ValueError, what seems to be happening is that the the program is trying to reshape the original array of 5 (which holds the trends) into array of dimensions 5x500, but it can't because there are only 5 elements in the original array. 500 is derived from the traj_bins instance variable, which was initialized according to the no_bins parameter. In reality, the trends array can only be reshaped into an array of 5x1 dimensions... so I was wondering, is the length of trends correct?
In [72]:
der.compute_marker_trends(mb_data_f5, der.branch_prob.columns, True, n_jobs=1)
It works when compute_std parameter is set to False, shown below:
In [47]:
der.compute_marker_trends(mb_data_f5, der.branch_prob.columns, False, n_jobs=1)
Out[47]:
But let's plot the markers onto a graph- the function plot_markers will automatically call compute_marker_trends by itself, so there is no need to explicitly call the function compute_markers_trends like we did above.
In [48]:
der.plot_markers(mb_data_f5, branches = der.branch_prob.columns)
Let's see the tSNE maps of this particular dataset.
In [6]:
m, n = scanalysis.utils.pca.run_pca(mb_data)
In [9]:
Eigvec, Eigval = scanalysis.utils.diffusionmap.run_diffusion_map(m)
In [10]:
Eigvec
Out[10]:
In [11]:
Eigval
Out[11]:
In [12]:
t2 = scanalysis.utils.tsne.TSNE()
d2 = t2.fit_transform(Eigvec)
In [13]:
d2
Out[13]:
In [17]:
fig, ax = scanalysis.plots.plot.plot_tsne(d2)
In [25]:
fig = plt.figure(figsize=[5, 4])
scanalysis.plots.plot.plot_tsne_by_cell_sizes(mb_data, d2, fig = fig)
Out[25]:
Use the tSNE data provided by Manu.
In [15]:
tsne = scanalysis.io.loadsave.load('~/tsne.csv')
Filter out and only include the tSNE data that corresponds to the original mb_data dataset. In other words, filter out to take only the cell names in the index of mb_data from tSNE data.
In [16]:
tsne1 = tsne.loc[mb_data.index]
In [17]:
tsne1
Out[17]:
In [18]:
der.plot_palantir_on_tsne(tsne1)
Out[18]:
In [11]:
data_part = mb_data.loc[:,['HBB', 'CD34', 'MPO', 'LYZ', 'CST3']]
In [32]:
data_part
Out[32]:
In [33]:
der.compute_marker_trends(data_part, None, False, n_jobs=1)
Out[33]:
In [54]:
der.plot_markers(data_part)
In [18]:
trends = scanalysis.io.loadsave.load("~/trends.p")
In [19]:
trends
Out[19]:
In [9]:
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
In [10]:
magic_mb_data = scanalysis.tools.magic.run_magic(mb_data)
In [14]:
magic_mb_data
Out[14]:
In [13]:
der.plot_clusters_of_gene_trends(marker_data= magic_mb_data, genes = ['MAGIC HBB', 'MAGIC CD34', 'MAGIC MPO', 'MAGIC LYZ', 'MAGIC CST3'], branch= "Run4_235626713532342")
Out[13]:
In [10]:
der.plot_clusters_of_gene_trends(marker_data =mb_data, genes = ['HBB', 'CD34', 'MPO', 'LYZ', 'CST3'], branch= 'Run5_161340154039083'
Out[10]:
In [ ]:
trends
In [11]:
der.plot_clusters_of_gene_trends(marker_data= mb_data, genes = ['HBB', 'CD34', 'MPO', 'LYZ', 'CST3'], branch= 'Run5_239477254471070')
Out[11]:
Setty M, Tadmor MD, Reich-Zeliger S, Angel O, Salame TM, Kathail P, Choi K, Bendall S, Friedman N, Pe’er D. "Wishbone identifies bifurcating developmental trajectories from single-cell data." Nat. Biotech. 2016 April 12. http://dx.doi.org/10.1038/nbt.3569
van Dijk, David, et al. "MAGIC: A diffusion-based imputation method reveals gene-gene interactions in single-cell RNA-sequencing data." BioRxiv (2017): 111591. http://www.biorxiv.org/content/early/2017/02/25/111591
Go back to the top.