Trajectory inference methods
During my internship this past summer, I spent time surveying the many existing trajectory inference (TI) methods. TI methods leverage data from thousands of single cells to reconstruct lineage paths and infer the order of cells along developmental trajectories.
Selecting a method is not trivial. With the many published algorithms available and the rather nonexistent comprehensive assessment of each method, my first task was to identify a handful of promising methods to evaluate. Thanks to this paper where authors compared 45 TI methods on both real and synthetic data sets to assess performance, scalability, robustness, and useabiliity, I was able to narrow down a list of candidate methods.
Ultimately I moved forward with the following: PAGA (Python), Palantir (Python), and Slingshot (R).
Simply –
1. PAGA learns topology by partitioning a kNN graph and ordering cells using random-walk-based distances in the PAGA graph.
2. Palantir uses diffusion-map components to represent the underlying trajectory. It then assigns pseudotime by modeling cell fates as a probabilistic process using a Markov chain
3. Slingshot infers lineage by taking clustered cells, connecting these clusters using a minimum spanning tree, and then fitting principle curves for each detected branch. A separate pseudotime is calculated for each lineage.
Here, I will walk through the steps I used to identify trajectories and calculate pseudotime for a data set. For simplicity, I will use the Paul15 data set (Development of Myeloid Progenitors) provided within Scanpy
. As a heads up – since I will be using tools from both R and Python, I will be switching between the two languages in this notebook.
Set up
First loading necessary python packages.
import matplotlib.pyplot as plt
import pandas as pd
import scanpy as sc
import scanpy.external as sce
Next I load the paul15 data set and perform some scripted preprocessing steps, some of which I’ve discussed in a previous post.
adata = sc.datasets.paul15()
## WARNING: In Scanpy 0.*, this returned logarithmized data. Now it returns non-logarithmized data.
## /Users/joy/miniconda3/lib/python3.9/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
## c.reorder_categories(natsorted(c.categories), inplace=True)
## ... storing 'paul15_clusters' as categorical
## Trying to set attribute `.uns` of view, copying.
# preprocessing QC
sc.pp.recipe_zheng17(adata)
# calculate pca
sc.tl.pca(adata)
# compute the neighbor graph
sc.pp.neighbors(adata, n_neighbors = 20)
# calculate umap embedding
sc.tl.umap(adata)
Now that the preprocessing is done we can explore the AnnData object. This object contains information for 999 genes across 2730 cells. Any operations performed on the data are also documented. Here we performed log-transformation (log1p), calculated pca, nearest neighbors, and umap. Notice also that within obs
there is a column names ‘paul15_clusters’. This specifies the cluster assignments given to the data.
adata
## AnnData object with n_obs × n_vars = 2730 × 999
## obs: 'paul15_clusters', 'n_counts_all'
## var: 'n_counts', 'mean', 'std'
## uns: 'iroot', 'log1p', 'pca', 'neighbors', 'umap'
## obsm: 'X_pca', 'X_umap'
## varm: 'PCs'
## obsp: 'distances', 'connectivities'
We can visualize the data in umap space and color the cells based on which cluster they fall into.
sc.pl.umap(adata, color='paul15_clusters', legend_loc='on data', title = '')
Identify a start cell
Many trajectory inference methods require the user to specify a cell or cell group as the ‘starting cell’. Or in other words, a cell/cell group that is at the least differentiated state. For this analysis, I will use a cell that is within the 7MEP (myeloid erythroid progenitor) cluster.
# using cell within 7MEP cluster
adata.obs[adata.obs['paul15_clusters'] == '7MEP'][0:6]
## paul15_clusters n_counts_all
## 0 7MEP 353.0
## 15 7MEP 285.0
## 48 7MEP 545.0
## 64 7MEP 365.0
## 65 7MEP 2714.0
## 66 7MEP 424.0
Here I have listed 6 cells that belong to this cluster. I will arbitrarily choose one to act as the starting cell.
# visualize location of cell on umap
cell = ['48']
umap = pd.DataFrame(adata.obsm['X_umap'][:,0:2], index=adata.obs_names, columns=['x', 'y'])
plt.scatter(umap["x"], umap["y"], s=5, color="lightgrey")
plt.scatter(umap.loc[cell, "x"], umap.loc[cell, "y"], s=30)
Run trajectory inference Methods: PAGA & Palantir
Now I am ready to run the Python TI methods. Below I have defined a function that will run PAGA and Palantir. Notice that only Palantir requires a start cell be specified.
def run_trajectory(data, clusterby, startcell):
print('Running PAGA')
sc.tl.paga(data, groups = clusterby)
print('Calculating Palantir Diffusion maps')
# using adaptive anisotropic adjacency matrix, instead of PCA projections (default) to compute diffusion components
sce.tl.palantir(data, n_components = 25, knn = 50, impute_data = False, use_adjacency_matrix = True)
print('Run Palantir')
start = startcell
pr_res = sce.tl.palantir_results(data, knn = 50, early_cell = start)
# add pseudotime and differentiation potential information to anndata object
data.obs['palantir_pseudotime'] = pr_res.pseudotime
data.obs['palantir_entropy'] = pr_res.entropy
run_trajectory.pr_res = pr_res.branch_probs
Palantir generates the following results:
1. Pseudotime: Pseudotime ordering of each cell
2. Terminal state probabilities: Matrix of cells X terminal states. Each entry represents the probability of the corresponding cell reaching the respective terminal state
3. Entropy: A quantitative measure of the differentiation potential of each cell computed as the entropy of the multinomial terminal state probabilities
# run methods
run_trajectory(adata, clusterby = 'paul15_clusters', startcell = '48')
Visualize PAGA and Palantir results
On the left is the standard umap with cells colored by cluster identity. On the right is the PAGA graph. Thicker lines represent stronger connections between cell groups. As you would expect, the progenitor cell groups (MEP, GMP) are at the center of the graph. From here, branching correlates with erythroid and myeloid differentiation as we would expect. This matches nicely with the umap embedding as well.
Palantir, on the other hand, identifies terminal cells. Below I’ve labeled the locations of these terminal cells.
I can also use the AnnData object to identify which cluster group these terminal cells belong to:
adata.obs[adata.obs.index.isin(['2566','2034','801'])]['paul15_clusters']
## 801 11DC
## 2034 3Ery
## 2566 15Mo
## Name: paul15_clusters, dtype: category
## Categories (19, object): ['1Ery', '2Ery', '3Ery', '4Ery', ..., '16Neu', '17Neu', '18Eos', '19Lymph']
Given these terminal cells, Palantir also calculates terminal state probabilities (see above). These values can be visualized on the umap:
Alternatively, the pseudotime across all cells can be plotted:
Run Slingshot
While PAGA and Palantir can be run in Python, in order to run the slingshot algorithm, we must move to R.
Slingshot has 2 steps:
1. Identify lineage structure with a cluster-based minimum spanning tree.
2. Construct smooth representations of each lineage using simultaneous principal curves
library(slingshot, quietly = T)
library(dplyr)
Since I have been handling the single-cell data in AnnData format, I will call the object using the reticulate
package notation instead of converting the data to a Seurat object. I will use data
to specify data after dimensional reduction by pca and clusters
to indicate to which group each cell belongs.
data <- py$adata$obsm['X_pca']
clusters <- py$adata$obs['paul15_clusters'] %>% dplyr::pull()
Similar to Palantir, I also specify a starting point. In this case, because Slingshot draws connections between clusters I assign a starting cluster.
results <- slingshot(data, clusterLabels = clusters, start.clus = '7MEP')
## Using diagonal covariance matrix
The output of a slingshot run contains information about each lineage
path identified by the algorithm. Below I pull this information and convert it to a python object. I then add it to the .uns
element of the AnnData object. Essentially, this describes the path from the user defined starting cluster (MEP) to all terminal states determined by Slingshot.
results@lineages
## $Lineage1
## [1] "7MEP" "12Baso" "9GMP" "10GMP" "13Baso" "14Mo" "15Mo" "16Neu"
##
## $Lineage2
## [1] "7MEP" "6Ery" "5Ery" "4Ery" "3Ery" "2Ery" "1Ery"
##
## $Lineage3
## [1] "7MEP" "12Baso" "9GMP" "10GMP" "11DC"
##
## $Lineage4
## [1] "7MEP" "12Baso" "17Neu"
##
## $Lineage5
## [1] "7MEP" "12Baso" "19Lymph"
##
## $Lineage6
## [1] "7MEP" "12Baso" "18Eos"
##
## $Lineage7
## [1] "7MEP" "8Mk"
lineages <- r_to_py(results@lineages)
adata.uns['lineages'] = r.lineages
Visualize slingshot results
For visualization, I can then overlay these paths on the umap. Similar to both PAGA and Palantir, we observe a branching pattern that corresponds with erythroid and myeloid differentiation.
Wrap up
Over the course of my internship, I used these algorithms to identify trajectories describing potential cell differentiation patterns. While the algorithms themselves are rather straightforward to implement, there are many nuanced difficulties when performing such analyses. A sticking point I found was how to best specify a starting cell. Oftentimes the data I worked with contained differentiated cell types almost exclusively. As such, selecting a cell/cell group to designate as a start point was difficult. Additionally, it is challenging to consider how to integrate study metadata (treatment group, patient id etc.) into this analysis. If samples from 30 patients were collected, how do we estimate cell trajectories? Is it appropriate to group all of the data and find a common trajectory? What about trajectories unique to certain patients? How do we quantify the amount of uncertainty in an estimated lineage path? So many things to consider!! But overall I had a blast learning some new computational techniques and diving into the world of single-cell genomics.