Now that you’ve run cellranger count and maybe even cellranger aggr on your single-cell RNAseq samples, you’re ready to start exploring.
- sample/outs/web_summary.html – Open with your web browser to see basic QC on the library, any warnings about low-quality results, and simple summary statistics. If you used aggr, it lists counts by library. Click on “Analysis” and you’ll see some basic tSNE plots (can be colored by Kmeans clustering or library id) along with sortable lists of genes distinguishing Kmeans clusters.
- sample/outs/cloupe.cloupe – As mentioned previously, this file can be opened with 10X Genomics Loupe Cell Browser.
- sample/outs/analysis/diffexp/kmeans_n_clusters (n being 2 through 10) – Each folder contains a csv file listing differential expression of genes by cluster, along with p-values.
- sample/outs/filtered_gene_bc_matrices_mex/mm10 (obviously I used mm10 genome) – this contains the files needed for loading into Seurat. More on that in the next post.
This is only a basic sampling. Explore the folders to find more.
Setting up R for cellrangerRkit
10X Genomics provides a nice vignette using public PBMC data. My outline will follow this vignette with some explanations and variations.
The full R script file containing the commands shown here can be downloaded from this link.
Load your data
Fire up RStudio and load the library:
Use the Files tab in RStudio to change directories to where you have stored your sample results. Choose “Set as working directory” from the More menu. The system will enter this command for you (with the example location from my system):
The “agg” folder is the one I used to collect output from my cellranger aggr job from the previous post. It contains the “outs” folder along with some other stuff.
Now you can use this same directory address to specify your “pipestance_path” (cellranger’s term for it):
cellranger_pipestance_path="C:/scRNAseq/agg" gbm=load_cellranger_matrix(cellranger_pipestance_path) analysis_results=load_cellranger_analysis_results(cellranger_pipestance_path)
At this point you’ve got a giant object for the GeneBCMatrix (gbm, 21.3 Mb in my experiment) and a list of analysis results (68.1 Mb).
Check expression levels
Let’s extract the tSNE plot data and visualize the RNA read depth for each cell.
tsne_proj=analysis_results$tsne visualize_umi_counts(gbm,tsne_proj[c("TSNE.1","TSNE.2")],limits=c(3,4),marker_size = 0.05)
This will draw a standard tSNE plot with the total number of UMIs (unique molecular identifiers – the tag specific for each cell) for each cell. Each dot is one cell. The scale is in log10 of the UMI number per cell.
Looks good so far, let’s move on to choosing only expressed genes.
Nonzero genes and normalization
use_genes=get_nonzero_genes(gbm) gbm_bcnorm=normalize_barcode_sums_to_median(gbm[use_genes,]) gbm_log=log_gene_bc_matrix(gbm_bcnorm,base=10)
This gives you an array of indices for the genes observed in any cell, and uses only these genes to normalize and log10 convert expression levels.
print(dim(gbm_log))  17627 57575
This gives you the number of non-zero genes (17,627) and cells (57,575) in your project.
Check expression of selected genes
If you already know a few genes that likely distinguish various sub-groups of cells in your experiment, you can plot them now. Manually create an array of gene symbols first.
This produces a faceted plot with each showing expression level by cell for that gene.
If you’re lucky, key cell types can be identified rapidly using this method. If not, you’ll need to continue on to clustering to identify genes distinguishing clusters.
Another simple example is to color the dots in the tSNE plot by library identifier. Here we’re going to use a trick from this post at the 10X Genomics help site, where we strip the library id number from the grouped cell barcode identifier. When cellranger aggr runs to combine multiple libraries, each one contains the same set of UMI barcodes. So to distinguish them, it adds a “_1”, “_2” and so on for each library. This code retrieves and extracts that to create a “gem_group” vector.
gem_group=sapply(strsplit(as.character(pData(gbm)$barcode),"-"), '[[',2) visualize_clusters(gem_group,tsne_proj[c("TSNE.1","TSNE.2")])
The result is the same tSNE plot, now colored by library id value. This should match up the order of samples in your agg_samples.csv file when you ran cellranger aggr.
Clearly in my case samples 3 and 4 are distinct from samples 1 and 2, which overlap nicely.
Now let’s load the pre-calculated cluster data and explore a little. Cellranger automatically creates Kmeans clusters with k ranging from 2 to 10.
n_clu=2:10 km_res=analysis_results$clustering clu_res=sapply(n_clu,function(x) km_res[[paste("kmeans",x,"clusters",sep="_")]]$Cluster) colnames(clu_res)=sapply(n_clu,function(x) paste("kmeans",x,sep=".")) visualize_clusters(clu_res,tsne_proj[c("TSNE.1","TSNE.2")])
This will plot panels for each k value, showing how the tSNE plot is divided by that number of clusters.
I’ll choose k=5 to examine more closely.
example_K=5 example_col=rev(brewer.pal(example_K,"Set2")) cluster_result=analysis_results$clustering[[paste("kmeans",example_K,"clusters",sep="_")]] visualize_clusters(cluster_result$Cluster,tsne_proj[c("TSNE.1","TSNE.2")],colour=example_col)
We’ll next extract genes that best distinguish the 5 clusters and draw a heatmap of the top 3 genes to compare among clusters.
The numbers of cells per cluster and the proportion of cells in each cluster is displayed with this function:
cell_composition(cluster_result$Cluster) Cell composition: 1 2 3 4 5 num_cells 1.6848e+04 1.2031e+04 1.1501e+04 9348.00000 7847.00000 proportion 2.9263e-01 2.0896e-01 1.9976e-01 0.16236 0.13629
Finally, you can output gene lists by cluster. First be sure to create a “gene_sets” folder within your working directory, then: