Analysis of single-cell RNAseq data with CellrangerRkit

Now that you’ve run cellranger count and maybe even cellranger aggr on your single-cell RNAseq samples, you’re ready to start exploring.

As discussed previously, you have results to explore without firing up your RStudio.  This page describes many of the output files.  The cellranger output includes the following useful files:

  • sample/outs/web_summary.html – Open with your web browser to see basic QC on the library, any warnings about low-quality results, and simple summary statistics.  If you used aggr, it lists counts by library. Click on “Analysis” and you’ll see some basic tSNE plots (can be colored by Kmeans clustering or library id) along with sortable lists of genes distinguishing Kmeans clusters.
  • sample/outs/cloupe.cloupe – As mentioned previously, this file can be opened with 10X Genomics Loupe Cell Browser.
  • sample/outs/analysis/diffexp/kmeans_n_clusters (n being 2 through 10) – Each folder contains a csv file listing differential expression of genes by cluster, along with p-values.
  • sample/outs/filtered_gene_bc_matrices_mex/mm10 (obviously I used mm10 genome) – this contains the files needed for loading into Seurat.  More on that in the next post.

This is only a basic sampling.  Explore the folders to find more.

Setting up R for cellrangerRkit

Start by downloading R from a CRAN mirror, and the free desktop version of RStudio.  Follow directions for your operating system for downloading and installing cellrangerRkit from 10X Genomics.

10X Genomics provides a nice vignette using public PBMC data.  My outline will follow this vignette with some explanations and variations.

The full R script file containing the commands shown here can be downloaded from this link.

Load your data

Fire up RStudio and load the library:

library(cellrangerRkit)

Use the Files tab in RStudio to change directories to where you have stored your sample results.  Choose “Set as working directory” from the More menu.  The system will enter this command for you (with the example location from my system):

setwd("C:/scRNAseq/agg")

The “agg” folder is the one I used to collect output from my cellranger aggr job from the previous post.  It contains the “outs” folder along with some other stuff.

Now you can use this same directory address to specify your “pipestance_path” (cellranger’s term for it):

cellranger_pipestance_path="C:/scRNAseq/agg"
gbm=load_cellranger_matrix(cellranger_pipestance_path)
analysis_results=load_cellranger_analysis_results(cellranger_pipestance_path)

At this point you’ve got a giant object for the GeneBCMatrix (gbm, 21.3 Mb in my experiment) and a list of analysis results (68.1 Mb).

Check expression levels

Let’s extract the tSNE plot data and visualize the RNA read depth for each cell.

tsne_proj=analysis_results$tsne
visualize_umi_counts(gbm,tsne_proj[c("TSNE.1","TSNE.2")],limits=c(3,4),marker_size = 0.05)
numUMIs
tSNE plot with each dot representing a single cell, colored by number of detected sequencing reads.

This will draw a standard tSNE plot with the total number of UMIs (unique molecular identifiers – the tag specific for each cell) for each cell.  Each dot is one cell.  The scale is in log10 of the UMI number per cell.

Looks good so far, let’s move on to choosing only expressed genes.

Nonzero genes and normalization

use_genes=get_nonzero_genes(gbm)
gbm_bcnorm=normalize_barcode_sums_to_median(gbm[use_genes,])
gbm_log=log_gene_bc_matrix(gbm_bcnorm,base=10)

This gives you an array of indices for the genes observed in any cell, and uses only these genes to normalize and log10 convert expression levels.

print(dim(gbm_log))
[1] 17627 57575

This gives you the number of non-zero genes (17,627) and cells (57,575) in your project.

Check expression of selected genes

If you already know a few genes that likely distinguish various sub-groups of cells in your experiment, you can plot them now.  Manually create an array of gene symbols first.

genes=c("Ddx21","Ldha","Sox11","Fabp7")
visualize_gene_markers(gbm_log,genes,tsne_proj[c("TSNE.1","TSNE.2")],limits=c(0,1.5))
gene tSNE plot
tSNE plots depicting detectable levels of indicated genes per cell.

This produces a faceted plot with each showing expression level by cell for that gene.

If you’re lucky, key cell types can be identified rapidly using this method.  If not, you’ll need to continue on to clustering to identify genes distinguishing clusters.

Plot libraries

Another simple example is to color the dots in the tSNE plot by library identifier.  Here we’re going to use a trick from this post at the 10X Genomics help site, where we strip the library id number from the grouped cell barcode identifier.  When cellranger aggr runs to combine multiple libraries, each one contains the same set of UMI barcodes.  So to distinguish them, it adds a “_1”, “_2” and so on for each library.  This code retrieves and extracts that to create a “gem_group” vector.

gem_group=sapply(strsplit(as.character(pData(gbm)$barcode),"-"), '[[',2)
visualize_clusters(gem_group,tsne_proj[c("TSNE.1","TSNE.2")])
tSNE by library
tSNE plot colored by sample number (library ID value).

The result is the same tSNE plot, now colored by library id value.  This should match up the order of samples in your agg_samples.csv file when you ran cellranger aggr.

Clearly in my case samples 3 and 4 are distinct from samples 1 and 2, which overlap nicely.

Clusters

Now let’s load the pre-calculated cluster data and explore a little.  Cellranger automatically creates Kmeans clusters with k ranging from 2 to 10.

n_clu=2:10
km_res=analysis_results$clustering
clu_res=sapply(n_clu,function(x) km_res[[paste("kmeans",x,"clusters",sep="_")]]$Cluster)
colnames(clu_res)=sapply(n_clu,function(x) paste("kmeans",x,sep="."))
visualize_clusters(clu_res,tsne_proj[c("TSNE.1","TSNE.2")])

This will plot panels for each k value, showing how the tSNE plot is divided by that number of clusters.

kmeans tSNE
tSNE plots split by k value (number of clusters) and colored by k-means cluster number.

I’ll choose k=5 to examine more closely.

example_K=5
example_col=rev(brewer.pal(example_K,"Set2"))
cluster_result=analysis_results$clustering[[paste("kmeans",example_K,"clusters",sep="_")]]
visualize_clusters(cluster_result$Cluster,tsne_proj[c("TSNE.1","TSNE.2")],colour=example_col)
5k cluster tSNE
tSNE plot colored by cluster number with k=5 clusters.

 

We’ll next extract genes that best distinguish the 5 clusters and draw a heatmap of the top 3 genes to compare among clusters.

cells_to_plot=order_cell_by_clusters(gbm,cluster_result$Cluster)
prioritized_genes=prioritize_top_genes(gbm,cluster_result$Cluster,"sseq",min_mean=0.5)gbm_pheatmap(log_gene_bc_matrix(gbm),prioritized_genes,cells_to_plot,n_genes=3,colour=example_col,limits=c(-1,2))
heatmap k5 clusters
Heatmap of top 3 genes per cluster, with vertical slices representing individual cells.

The numbers of cells per cluster and the proportion of cells in each cluster is displayed with this function:

cell_composition(cluster_result$Cluster)
Cell composition: 
                    1          2          3          4          5
num_cells  1.6848e+04 1.2031e+04 1.1501e+04 9348.00000 7847.00000
proportion 2.9263e-01 2.0896e-01 1.9976e-01    0.16236    0.13629

Finally, you can output gene lists by cluster.  First be sure to create a “gene_sets” folder within your working directory, then:

output_folder="I:/scRNAseq/agg/gene_sets"
write_cluster_specific_genes(prioritized_genes,output_folder,n_genes=20)

Next step…

We’ll compare with SeuratChapter 1: Single sample analysis.

Advertisement

One thought on “Analysis of single-cell RNAseq data with CellrangerRkit”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: