Trajectory inference methods

Oct 16, 2021 Python, R, genomics

During my internship this past summer, I spent time surveying the many existing trajectory inference (TI) methods. TI methods leverage data from thousands of single cells to reconstruct lineage paths and infer the order of cells along developmental trajectories.

Selecting a method is not trivial. With the many published algorithms available and the rather nonexistent comprehensive assessment of each method, my first task was to identify a handful of promising methods to evaluate. Thanks to this paper where authors compared 45 TI methods on both real and synthetic data sets to assess performance, scalability, robustness, and useabiliity, I was able to narrow down a list of candidate methods.

Ultimately I moved forward with the following: PAGA (Python), Palantir (Python), and Slingshot (R).

Simply –
1. PAGA learns topology by partitioning a kNN graph and ordering cells using random-walk-based distances in the PAGA graph.
2. Palantir uses diffusion-map components to represent the underlying trajectory. It then assigns pseudotime by modeling cell fates as a probabilistic process using a Markov chain
3. Slingshot infers lineage by taking clustered cells, connecting these clusters using a minimum spanning tree, and then fitting principle curves for each detected branch. A separate pseudotime is calculated for each lineage.

Here, I will walk through the steps I used to identify trajectories and calculate pseudotime for a data set. For simplicity, I will use the Paul15 data set (Development of Myeloid Progenitors) provided within Scanpy. As a heads up – since I will be using tools from both R and Python, I will be switching between the two languages in this notebook.

Set up

First loading necessary python packages.

import matplotlib.pyplot as plt
import pandas as pd
import scanpy as sc
import scanpy.external as sce

Next I load the paul15 data set and perform some scripted preprocessing steps, some of which I’ve discussed in a previous post.

adata = sc.datasets.paul15()

## WARNING: In Scanpy 0.*, this returned logarithmized data. Now it returns non-logarithmized data.
## /Users/joy/miniconda3/lib/python3.9/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
##   c.reorder_categories(natsorted(c.categories), inplace=True)
## ... storing 'paul15_clusters' as categorical
## Trying to set attribute `.uns` of view, copying.

# preprocessing QC
sc.pp.recipe_zheng17(adata)
# calculate pca
sc.tl.pca(adata)
# compute the neighbor graph
sc.pp.neighbors(adata, n_neighbors = 20)
# calculate umap embedding
sc.tl.umap(adata)

Now that the preprocessing is done we can explore the AnnData object. This object contains information for 999 genes across 2730 cells. Any operations performed on the data are also documented. Here we performed log-transformation (log1p), calculated pca, nearest neighbors, and umap. Notice also that within obs there is a column names ‘paul15_clusters’. This specifies the cluster assignments given to the data.

adata

## AnnData object with n_obs × n_vars = 2730 × 999
##     obs: 'paul15_clusters', 'n_counts_all'
##     var: 'n_counts', 'mean', 'std'
##     uns: 'iroot', 'log1p', 'pca', 'neighbors', 'umap'
##     obsm: 'X_pca', 'X_umap'
##     varm: 'PCs'
##     obsp: 'distances', 'connectivities'

We can visualize the data in umap space and color the cells based on which cluster they fall into.

sc.pl.umap(adata, color='paul15_clusters', legend_loc='on data', title = '')

Identify a start cell

Many trajectory inference methods require the user to specify a cell or cell group as the ‘starting cell’. Or in other words, a cell/cell group that is at the least differentiated state. For this analysis, I will use a cell that is within the 7MEP (myeloid erythroid progenitor) cluster.

# using cell within 7MEP cluster 
adata.obs[adata.obs['paul15_clusters']  == '7MEP'][0:6]

##    paul15_clusters  n_counts_all
## 0             7MEP         353.0
## 15            7MEP         285.0
## 48            7MEP         545.0
## 64            7MEP         365.0
## 65            7MEP        2714.0
## 66            7MEP         424.0

Here I have listed 6 cells that belong to this cluster. I will arbitrarily choose one to act as the starting cell.

# visualize location of cell on umap
cell = ['48']
umap = pd.DataFrame(adata.obsm['X_umap'][:,0:2], index=adata.obs_names, columns=['x', 'y'])

plt.scatter(umap["x"], umap["y"], s=5, color="lightgrey")
plt.scatter(umap.loc[cell, "x"], umap.loc[cell, "y"], s=30)

Run trajectory inference Methods: PAGA & Palantir

Now I am ready to run the Python TI methods. Below I have defined a function that will run PAGA and Palantir. Notice that only Palantir requires a start cell be specified.

def run_trajectory(data, clusterby, startcell):
    print('Running PAGA')
    sc.tl.paga(data, groups = clusterby)
    
    print('Calculating Palantir Diffusion maps')
    # using adaptive anisotropic adjacency matrix, instead of PCA projections (default) to compute diffusion components
    sce.tl.palantir(data, n_components = 25, knn = 50, impute_data = False, use_adjacency_matrix = True)
    
    print('Run Palantir')
    start = startcell
    pr_res = sce.tl.palantir_results(data, knn = 50, early_cell = start)
    # add pseudotime and differentiation potential information to anndata object
    data.obs['palantir_pseudotime'] = pr_res.pseudotime
    data.obs['palantir_entropy'] = pr_res.entropy
    
    run_trajectory.pr_res = pr_res.branch_probs

Palantir generates the following results:
1. Pseudotime: Pseudotime ordering of each cell
2. Terminal state probabilities: Matrix of cells X terminal states. Each entry represents the probability of the corresponding cell reaching the respective terminal state
3. Entropy: A quantitative measure of the differentiation potential of each cell computed as the entropy of the multinomial terminal state probabilities

# run methods
run_trajectory(adata, clusterby = 'paul15_clusters', startcell = '48')

Visualize PAGA and Palantir results

On the left is the standard umap with cells colored by cluster identity. On the right is the PAGA graph. Thicker lines represent stronger connections between cell groups. As you would expect, the progenitor cell groups (MEP, GMP) are at the center of the graph. From here, branching correlates with erythroid and myeloid differentiation as we would expect. This matches nicely with the umap embedding as well.

Palantir, on the other hand, identifies terminal cells. Below I’ve labeled the locations of these terminal cells.

I can also use the AnnData object to identify which cluster group these terminal cells belong to:

adata.obs[adata.obs.index.isin(['2566','2034','801'])]['paul15_clusters']

## 801     11DC
## 2034    3Ery
## 2566    15Mo
## Name: paul15_clusters, dtype: category
## Categories (19, object): ['1Ery', '2Ery', '3Ery', '4Ery', ..., '16Neu', '17Neu', '18Eos', '19Lymph']

Given these terminal cells, Palantir also calculates terminal state probabilities (see above). These values can be visualized on the umap:

Alternatively, the pseudotime across all cells can be plotted:

Run Slingshot

While PAGA and Palantir can be run in Python, in order to run the slingshot algorithm, we must move to R.

Slingshot has 2 steps:
1. Identify lineage structure with a cluster-based minimum spanning tree.
2. Construct smooth representations of each lineage using simultaneous principal curves

library(slingshot, quietly = T)
library(dplyr)

Since I have been handling the single-cell data in AnnData format, I will call the object using the reticulate package notation instead of converting the data to a Seurat object. I will use data to specify data after dimensional reduction by pca and clusters to indicate to which group each cell belongs.

data  <- py$adata$obsm['X_pca']
clusters <- py$adata$obs['paul15_clusters'] %>% dplyr::pull()

Similar to Palantir, I also specify a starting point. In this case, because Slingshot draws connections between clusters I assign a starting cluster.

results <- slingshot(data, clusterLabels = clusters, start.clus = '7MEP')

## Using diagonal covariance matrix

The output of a slingshot run contains information about each lineage path identified by the algorithm. Below I pull this information and convert it to a python object. I then add it to the .uns element of the AnnData object. Essentially, this describes the path from the user defined starting cluster (MEP) to all terminal states determined by Slingshot.

results@lineages

## $Lineage1
## [1] "7MEP"   "12Baso" "9GMP"   "10GMP"  "13Baso" "14Mo"   "15Mo"   "16Neu" 
## 
## $Lineage2
## [1] "7MEP" "6Ery" "5Ery" "4Ery" "3Ery" "2Ery" "1Ery"
## 
## $Lineage3
## [1] "7MEP"   "12Baso" "9GMP"   "10GMP"  "11DC"  
## 
## $Lineage4
## [1] "7MEP"   "12Baso" "17Neu" 
## 
## $Lineage5
## [1] "7MEP"    "12Baso"  "19Lymph"
## 
## $Lineage6
## [1] "7MEP"   "12Baso" "18Eos" 
## 
## $Lineage7
## [1] "7MEP" "8Mk"

lineages <- r_to_py(results@lineages)

adata.uns['lineages'] = r.lineages

Visualize slingshot results

For visualization, I can then overlay these paths on the umap. Similar to both PAGA and Palantir, we observe a branching pattern that corresponds with erythroid and myeloid differentiation.

Wrap up

Over the course of my internship, I used these algorithms to identify trajectories describing potential cell differentiation patterns. While the algorithms themselves are rather straightforward to implement, there are many nuanced difficulties when performing such analyses. A sticking point I found was how to best specify a starting cell. Oftentimes the data I worked with contained differentiated cell types almost exclusively. As such, selecting a cell/cell group to designate as a start point was difficult. Additionally, it is challenging to consider how to integrate study metadata (treatment group, patient id etc.) into this analysis. If samples from 30 patients were collected, how do we estimate cell trajectories? Is it appropriate to group all of the data and find a common trajectory? What about trajectories unique to certain patients? How do we quantify the amount of uncertainty in an estimated lineage path? So many things to consider!! But overall I had a blast learning some new computational techniques and diving into the world of single-cell genomics.