We’ve already seen how to load data into a Seurat object and explore sub-populations of cells within a sample, but often we’ll want to compare two samples, such as drug-treated vs. control. In this example we’ll use one sample made from a proliferating neuronal precursor cells (“Prolif”) and one that’s been differentiated into post-mitotic neurons (“Neurons”). A key aspect of doing this with single-cell RNAseq is that we won’t assume that either population is uniform.
As with previous posts, much of this follows vignettes posted on the Seurat website very closely.
Computing Resources
Before beginning, we should consider that these analyses require substantial computing resources. I use a Windows 10 desktop with 10 physical processors (20 logical processors; although most R functions are not multi-threaded by default) and 64 Gb of RAM. Some of the steps here took more than 10 minutes to run on this system. Saving the final data environment to an .RData file can take a while. So think about where to run this kind of work before you start on an old Mac laptop!
Load Data and Create Objects
As with the single-sample example, the steps are to load the 10X Genomics cellranger output into a data object, create a Seurat object, add metadata, filter, normalize, and scale. But before combining two objects, we need to add a sample-specific identifier to each UMI.
First, load the libraries and the data:
library(Seurat) library(dplyr) #create samples objects for samples 2 and 4 s2.data=Read10X(data.dir="../../sample2/outs/filtered_gene_bc_matrices/mm10/") s4.data=Read10X(data.dir="../../sample4/outs/filtered_gene_bc_matrices/mm10/")
Seurat provides a function “RenameCells” but I could never get that to work as expected. So I found a simple trick to use standard R functions (paste) to add a sample-specific string to each UMI string.
#prepend cell IDs (UMIs) with sample identifier to distinguish samples after merging colnames(x = s2.data) <- paste('s2', colnames(x = s2.data), sep = '_') colnames(x = s4.data) <- paste('s4', colnames(x = s4.data), sep = '_')
Now each UMI (for example, “AAACCTGAGCGAGAAA,” which is likely found in every sample) is replaced by one that distinguishes cells from each sample (such as “s2_AAACCTGAGCGAGAAA“).
Now we’re ready to create and pre-process our two Seurat objects:
#for each, create object, add metadata, filter, normalize, and scale s2=CreateSeuratObject(raw.data=s2.data,project="iMOP",min.cells=5) s2@meta.data$group="Neurons" s2=FilterCells(s2,subset.names="nGene",low.thresholds = 500,high.thresholds = Inf) s2=NormalizeData(s2) s2=ScaleData(s2,display.progress = T) s4=CreateSeuratObject(raw.data=s4.data,project="iMOP",min.cells=5) s4@meta.data$group="Prolif" s4=FilterCells(s4,subset.names="nGene",low.thresholds = 500,high.thresholds = Inf) s4=NormalizeData(s4) s4=ScaleData(s4,display.progress = T)
Combine Samples with CCA
Before proceeding to a canonical correlation analysis (CCA, which also combines the two samples), let’s find the 1000 genes from each sample with the highest dispersion. Then we’ll combine the two lists and confirm that they are found in both samples.
#select variable genes common to both samples s2=FindVariableGenes(s2,do.plot=T) s4=FindVariableGenes(s4,do.plot=T) g.2=head(rownames(s2@hvg.info),1000) g.4=head(rownames(s4@hvg.info),1000) genes.use=unique(c(g.2,g.4)) genes.use=intersect(genes.use,rownames(s2@scale.data)) genes.use=intersect(genes.use,rownames(s4@scale.data))
We can now use RunCCA to combine the two samples and also to identify common sources of variation between the two datasets.
#do CCA agg=RunCCA(s2,s4,genes.use=genes.use,num.cc=30)
To visualize the overlap of the two samples in CCA space and also to check distribution of expression signals, we can create two diagnostic plots:
p1=DimPlot(object=agg,reduction.use="cca",group.by="group",pt.size=0.5,do.return=T) p2=VlnPlot(object=agg,features.plot="CC1",group.by="group",do.return=T) plot_grid(p1,p2)

The PrintDim function outputs the top distinguishing genes in each CCA dimension.
PrintDim(object=agg,reduction.type="cca",dims.print=1:2,genes.print=10)
Finally, this plot shows the smoothed shared correlation strength versus CCA dimension number, to evaluate how many dimensions are useful.
MetageneBicorPlot(agg,grouping.var="group",dims.eval=1:30,display.progress=T)
A heatmap is plotted to associate the most variable genes with each cluster. For this example, I only plotted the first 9 CCs.
DimHeatmap(object=agg,reduction.type="cca",cells.use=500,dim.use=1:9,do.balanced=T)
With this you can now align the data to the CCA subspace–choose the number of CC dimensions that make sense for your sample. Note that each of these dimension reduction steps produces a new set of data under the @dr slot, so you can refer to this for clustering. After this point, you will have both “cca” and “cca.aligned” under this slot.
agg=AlignSubspace(agg,reduction.type="cca",grouping.var="group",dims.align=1:16)
Let’s plot distributions, as violin plots, for each of the first two CC dimensions.
p1=VlnPlot(object=agg,features.plot="ACC1",group.by="group",do.return=T) p2=VlnPlot(object=agg,features.plot="ACC2",group.by="group",do.return=T) plot_grid(p1,p2)
tSNE
One visualization method is to project the data into tSNE space. We can also use the cca.aligned data to find clusters. Note that you can specify how many CC dimensions to use for clustering and also specify the “resolution.” A resolution greater than one favors more clusters, less than one favors fewer clusters. Plot the tSNE space showing the sample identifier (“group”) and the clusters.
agg=RunTSNE(agg,reduction.use="cca.aligned",dims.use=1:16,do.fast=T) agg=FindClusters(agg,reduction.type="cca.aligned",resolution=0.6,dims.use=1:16) p1=TSNEPlot(agg,do.return=T,pt.size=0.5,group.by="group") p2=TSNEPlot(agg,do.label=T,do.return=T,pt.size=0.5) plot_grid(p1,p2)
Interestingly, with this dataset, tSNE did not turn out to separate the proliferating cells well from the neurons. There’s also a new @dr dataset named “tsne”. There’s 8 clusters and some clear overlap with samples, but it’s kind of a mess.
Principal Components Analysis
So next I tried principal components. For this I used only the @var.genes slot of the combined object, which has fewer genes than the genes.use list created above. I ask for a list of 5 distinguishing genes for each of the first 5 principal components.
agg=RunPCA(agg,pc.genes=agg@var.genes,do.print=T,pcs.print = 1:5,genes.print = 5)
Here’s the output:
[1] "PC1"
[1] "Malat1" "Cst3" "mt-Co1" "mt-Co3" "Itm2b"
[1] ""
[1] "Eif5a" "Lyar" "H2afz" "Ncl" "Ran"
[1] ""
[1] ""
[1] "PC2"
[1] "Rpl32" "Rps5" "Rps4x" "Eef1a1" "Rps7"
[1] ""
[1] "Dbi" "Lgals1" "Adh7" "Igfbp2" "mt-Atp6"
[1] ""
[1] ""
[1] "PC3"
[1] "Phgdh" "Aldoc" "Eef1d" "Ptn" "Eif1"
[1] ""
[1] "mt-Nd1" "Sox11" "Elavl3" "mt-Nd2" "mt-Atp6"
[1] ""
[1] ""
[1] "PC4"
[1] "Rps27a" "Rpl37" "Rps23" "Rpl32" "Rpl26"
[1] ""
[1] "Tubb3" "Rtn1" "Elavl3" "Stmn1" "Tuba1a"
[1] ""
[1] ""
[1] "PC5"
[1] "Tubb3" "Tuba1a" "Stmn2" "Elavl3" "Calm2"
[1] ""
[1] "Ybx3" "Mtdh" "mt-Nd2" "Kcnq1ot1" "Rplp1"
[1] ""
[1] ""
This looks promising (based on the genes). Note that it also adds another dataset in the @dr slot named “pca”. Try more visualizations.
#various ways to show output from PCA VizPCA(agg,pcs.use = 1:2) #plots component of each of top genes per PC
PCAPlot(agg,dim.1=1,dim.2=2) #all cells plotted on first two PCs
PCAPlot(agg,dim.1=1,dim.2=2,group.by="group") #show source samples
This shows a very clear distinction between the starting proliferating cells and the resulting neurons. That’s the kind of display I was looking for. Let’s see which genes distinguish a few PCs.
PCHeatmap(agg,pc.use=1,cells.use=500,do.balanced = T,label.columns = F) #first PC only
#try running more PCs to visualize how many explain variance PCHeatmap(agg,pc.use=1:6,cells.use = 500,do.balanced = T,label.columns = F,use.full=F)
Looks good. Next, project all the data onto PC space for differential expression analysis.
agg=ProjectPCA(agg,do.print = F)Differential Expression by Sample
Differential Expression by Sample
Before re-clustering in PCA space, let’s get lists of genes that are differentially expressed by input sample. To do this, we’ll overwrite the @ident slot (which contains the cluster identities from the first clustering attempt) with sample group names (from the metadata).
agg=SetAllIdent(agg,id="group") cell.markers=FindMarkers(agg,ident.1="Neurons",ident.2="Prolif",test.use="wilcox")
With this list of DE genes, we can also visualize results as though we had standard RNAseq samples, by averaging the cells within a group and plotting a scatterplot. The authors of Seurat posted a few nice functions for adding labels to a few gene dots on this plot, which you can download from this page. I stored the R code for the functions in a separate file, named immune_alignment_functions.R, which I source to load the functions.
prolif.mrkrs=rownames(head(cell.markers[order(-cell.markers$pct.2 + cell.markers$pct.1),],5)) neu.mrkrs=rownames(head(cell.markers[order(cell.markers$pct.2 - cell.markers$pct.1),],5)) #create averaged data model avg.cells=log1p(AverageExpression(agg,show.progress = F)) avg.cells$gene=rownames(avg.cells) #load scripts for labeling scatterplot source('immune_alignment_functions.R') #plot averaged data highlighting greatest differences p1=ggplot(avg.cells,aes(Neurons,Prolif))+geom_point(size=.75) p1=LabelUR(p1,genes=neu.mrkrs,avg.cells,adj.u.t=.3,adj.u.s=.23) p1=LabelUL(p1,genes=prolif.mrkrs,avg.cells,adj.u.t=.5,adj.u.s=.4,adj.l.t=.25,adj.l.s=.25) plot(p1)
You can also view the top DE genes as an array of PCA space plots:
FeaturePlot(agg,features.plot = c(neu.mrkrs,prolif.mrkrs),cols.use=c("grey","blue"),reduction.use="pca")
This shows a nice partition of neuron markers extending to the left and proliferative cell markers to the right.
JackStraw PCA
Seurat includes a more robust function for finding statistically significant PCs through the jackStraw algorithm. Try this an plot output.
agg=JackStraw(agg,num.replicate=100,display.progress = T) JackStrawPlot(agg,PCs=1:18) #to find how many are significant PCElbowPlot(agg) #another, simpler way to visualize
For me, the elbow plot is the most useful. It seems to say that the first 9 or 10 PCs really capture the majority of the variance.
Clustering and DE in PCA Space
So using only PCs 1:9, let’s try clustering in PCA space. I set the k to 2 (intending to find clusters focused on my two samples) and very low resolution of 0.1 to push towards fewer clusters at this point.
agg=FindClusters(agg,genes.use=agg@var.genes,reduction.type="pca",dims.use=1:9,k.param=4,save.SNN=T,force.recalc=T,plot.SNN=F,resolution=.1) p1=PCAPlot(agg,dim.1=1,dim.2=2,group.by="group") p2=PCAPlot(agg,dim.1=1,dim.2=2,group.by="ident") plot_grid(p1,p2)
Looks good–there’s a clear distinction between neurons and proliferating cells but a separation within each group. Let’s find which genes distinguish all four clusters.
all.markers=FindAllMarkers(agg,only.pos=T,min.pct=0.25,thresh.use=0.25)
all.markers %>% group_by(cluster) %>% top_n(5,avg_logFC)
This is the output showing the top 5 genes per cluster:
# A tibble: 20 x 7 # Groups: cluster [4] p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr> 1 0 2.47 0.986 0.514 0 0 Cst3 2 0 2.04 0.864 0.246 0 0 Itm2b 3 0 2.02 1 0.911 0 0 Malat1 4 0 1.79 0.787 0.249 0 0 Ramp1 5 0 1.73 0.857 0.484 0 0 Cd81 6 0 1.30 0.998 0.617 0 1 Eif5a 7 0 1.28 0.779 0.223 0 1 Vps8 8 0 1.26 0.925 0.209 0 1 Ddx21 9 0 1.21 0.977 0.416 0 1 Tomm5 10 0 1.20 0.936 0.264 0 1 Srm 11 0 1.52 0.608 0.088 0 2 Tubb3 12 0 1.18 0.878 0.474 0 2 Sox11 13 0 1.13 0.755 0.325 0 2 Stmn1 14 0 1.10 0.398 0.02 0 2 Elavl3 15 0 1.04 0.664 0.345 0 2 Mllt11 16 0 1.16 0.961 0.602 0 3 Eif2s2 17 0 1.13 0.991 0.773 0 3 Eif4a1 18 0 1.12 0.979 0.595 0 3 Ldha 19 0 1.11 1 0.922 0 3 Npm1 20 0 1.05 0.731 0.346 0 3 Ppa1
Make a table of the top 2 genes per cluster and plot dots showing which genes best characterize which cluster, split by sample group.
top2=all.markers %>% group_by(cluster) %>% top_n(2,avg_logFC) SplitDotPlotGG(agg,genes.plot=as.character(top2$gene),cols.use = c("blue","red"),x.lab.rot = T,plot.legend = T,dot.scale = 8,do.return = T,grouping.var = "group")
FeaturePlot(agg,features.plot = as.character(top2$gene),cols.use=c("grey","blue"),reduction.use="pca")
I think that clusters 1 and 3 best represents the proliferating cells, so let’s re-draw the scatterplot, labeling these top cluster genes appropriately (left/right labeling).
p1=ggplot(avg.cells,aes(Neurons,Prolif))+geom_point(size=.75) p1=LabelUR(p1,genes=subset(top2,cluster==0 | cluster==2 )$gene,avg.cells,adj.u.t=.3,adj.u.s=.23) p1=LabelUL(p1,genes=subset(top2,cluster==1 | cluster==3 )$gene,avg.cells,adj.u.t=.5,adj.u.s=.4,adj.l.t=.25,adj.l.s=.25) plot(p1)
There’s more to do–you can dump the all.markers object to a file. We can consider more clusters by increasing the resolution value in the FindAllMarkers step. Let the biology guide you!
One thought on “Seurat Chapter 2: Two Samples”