As I’ve learned more about the power of Seurat, I think it’ll be clearest if I split posts into three examples:
- Analyzing a single sample
- Combining and analyzing two samples
- Analyzing multiple (>2) samples
Each has a slightly novel way of dealing with the data and each builds on the previous example.
Based on my earlier post to run raw 10X Genomics sequencing output (fastq files) on a cluster to count transcripts and interpret barcodes, this post will start with the standard directory and file structure output by the cellranger count command.
You should already have installed R and RStudio. Install Seurat using the RStudio Packages pane. Click “Install” and start typing “Seurat.” The Seurat version available in CRAN should be v.2.3.3 and should load automatically along with any other required packages.
In RStudio, use the Files pane to find a convenient location for your working files and output. Choose the “More/Set as working directory” command. For all of the following example commands, you can also download my R script file.
Load Seurat and dplyr packages into the workspace:
For my example, I’m going to rely heavily on a vignette from the Seurat authors. That vignette does a nice job explaining the algorithms behind each step but I’m going to focus only on the procedure and outcomes.
To load your sample, determine the location of the directory named “filtered_gene_bc_matrices.” Under that should be a folder named with your reference genome–in my case it’s “mm10”. Using this location (relative to the current working directory–my working directory is adjacent to the sample directory), read the 10X Genomics output into an object.
Next, create a Seurat data model from this raw data. Seurat wants a project name (I used “iMOP”) and a filter to include only genes expressed in a minimum number of cells, here I chose 5 cells. There are many more options you can add at this stage but for now we’ll take our analysis stepwise through normalization and scaling to see how this works.
Count mitochondrial genes expressed
Both as a QC step and for scaling (below), let’s count the number of mitochondrial genes we saw per cell. First we’ll get a list of gene symbols (for mouse, all start with “mt-“).
I found 13 of the 37 mitochondrial genes in my sample, so this produces a vector of those 13 gene symbols. Use the summed counts of these genes per barcode, divided by the total numbers of counts for all genes, to get percent mitochondrial for each cell.
To get a sense of the distribution of values, use the summary command:
summary(percent.mito) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.005147 0.044402 0.054328 0.057863 0.066207 0.733269
My sample has a reasonably low overall rate (mean & median ~5%) but a few cells with a high rate (max 73%).
Seurat has a convenient slot named metadata that you can use to store things like this. This slot will come up again later when we add more samples in future posts.
s1=AddMetaData(s1,metadata = percent.mito,col.name="percent.mito")
Visualize cells and genes as distributions
You can now visualize the overall distributions of detected genes (“nGene”), numbers of unique molecules (as UMI ; “nUMI”), and the percent mitochondrial genes per cell, in the form of jittered dots overlaid on a violin plot:
VlnPlot(s1,features.plot = c("nGene","nUMI","percent.mito"),nCol=3)
Note that the scales for each plot are different.
Here’s a plot (with two panels) to show the relationships between the number of molecules (nUMI) and the (left) percent mitochondrial genes (from the percent.mito slot we added to the metadata) and (right) the total number of genes (nGene).
par(mfrow=c(1,2)) GenePlot(s1,gene1="nUMI",gene2="percent.mito") GenePlot(s1,gene1="nUMI",gene2="nGene")
Filter, normalize, and scale
Now let’s clean up the data with filtering and normalization. Remove cells with low gene counts (here, 500). You can also choose to exclude cells with unusually high counts, or, as I’ve done here, set the threshold to infinity.
s1=FilterCells(s1,subset.names="nGene",low.thresholds = 500,high.thresholds = Inf)
Normalize (using default parameters):
s1=NormalizeData(s1,normalization.method = "LogNormalize", scale.factor = 10000)
Before scaling, let’s find the genes with the greatest variance by cell:
Then scale, using the percent.mito as part of the regression:
s1=ScaleData(s1,vars.to.regress = c("nUMI","percent.mito"))
Principal Components Analysis
Using the most variable genes (from the FindVariableGenes function, stored in the var.genes slot), calculate principal components:
s1=RunPCA(s1,email@example.com,do.print=T,pcs.print = 1:5,genes.print = 5)
Seurat includes a number of visualization tools. Here’s example commands for each of them:
#lists top 5 genes per PC PrintPCA(s1,pcs.print = 1:5,genes.print = 5,use.full=F) #plots component of each of top genes per PC VizPCA(s1,pcs.use = 1:2) #all cells plotted on first two PCs PCAPlot(s1,dim.1=1,dim.2=2) #heatmap showing top genes for first PC only PCHeatmap(s1,pc.use=1,cells.use=500,do.balanced = T,label.columns = F)
The heatmap command can also be useful to visualize how many PCs explain variance–try 6, 12, or 18.
PCHeatmap(s1,pc.use=1:6,cells.use = 500,do.balanced = T,label.columns = F,use.full=F)
Seurat also includes a method to evaluate statistically significant PCs using jackStraw:
s1=JackStraw(s1,num.replicate=100,display.progress = T) #to find how many are significant JackStrawPlot(s1,PCs=1:18) #another, simpler way to visualize PCElbowPlot(s1)
The graph-based clustering method in Seurat relies on the PCA space for data reduction and uses methods similar to KNN and SLM–see the Seurat website for details. Choose how many PC dimensions you want to include based on the elbow plot.
s1=FindClusters(s1,reduction.type = "pca",dims.use=1:10,resolution=0.6,print.output = 0,save.SNN = T) #same PCA plot as above, now colored by cluster PCAPlot(s1,dim.1=1,dim.2=2)
A better way to visualize clusters is to use the tSNE plot:
s1=RunTSNE(s1,dims.use=1:10,do.fast=T) TSNEPlot(s1) # plot in tSNE space
Genes distinguishing clusters
You can then search for genes that distinguish clusters. In this example, I chose cluster 1 because I knew it expressed a number of characteristic neuronal markers. In this example I asked for genes distinguishing cluster 1 from all other clusters but you can also do binary comparisons if you like, using “ident.2” as another argument.
p_val avg_logFC pct.1 pct.2 p_val_adj Aldoc 0 0.9771343 0.935 0.511 0 Hes5 0 0.8220392 0.494 0.179 0 Fabp7 0 0.7561873 0.821 0.405 0 Sparc 0 0.7385394 0.780 0.372 0 Gstm1 0 0.7359628 0.726 0.386 0
Extend this to search for markers for all clusters:
s1.markers=FindAllMarkers(s1,only.pos=T,min.pct=0.25,thresh.use=0.25) #Display a short table showing top 2 genes per cluster s1.markers %>% group_by(cluster) %>% top_n(2,avg_logFC)
# A tibble: 16 x 7 # Groups: cluster  p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr> 1 7.62e- 38 0.352 0.306 0.212 1.07e- 33 0 Ptgds 2 4.24e- 30 0.325 0.413 0.331 5.93e- 26 0 Selenbp1 3 0. 0.977 0.935 0.511 0. 1 Aldoc 4 6.58e-298 0.829 0.499 0.189 9.21e-294 1 Apoe 5 0. 1.36 0.757 0.298 0. 2 Tubb3 6 0. 1.16 0.729 0.337 0. 2 Mllt11 7 0. 1.37 0.686 0.145 0. 3 Top2a 8 0. 1.23 0.851 0.326 0. 3 Hmgb2 9 0. 1.51 0.978 0.647 0. 4 Dbi 10 0. 1.37 0.603 0.123 0. 4 Ntrk2 11 3.49e- 74 0.512 0.707 0.38 4.88e- 70 5 Hmgb2 12 7.08e- 67 0.489 0.473 0.204 9.91e- 63 5 Top2a 13 0. 1.29 0.61 0.103 0. 6 Lhx1 14 1.15e-222 1.12 0.712 0.242 1.61e-218 6 Cks1b 15 0. 1.67 0.766 0.137 0. 7 Notch2 16 1.72e-174 0.944 0.317 0.052 2.40e-170 7 Slc39a1
This is only one of the possible differential expression tests available–see this vignette to get a list of all of them. For one more example, we’ll use the “roc” method to identify differential genes:
cluster1.roc=FindMarkers(s1,ident.1 = 1,thresh.use=0.25,test.use="roc",only.pos=T) print(head(cluster1.roc,n=5))
myAUC avg_diff power avg_logFC pct.1 pct.2 p_val_adj Aldoc 0.807 0.9771343 0.614 0.9771343 0.935 0.511 NA Cst3 0.793 0.7022215 0.586 0.7022215 0.997 0.848 NA Ptn 0.746 0.6726631 0.492 0.6726631 0.918 0.576 NA Fabp7 0.736 0.7561873 0.472 0.7561873 0.821 0.405 NA Itm2b 0.736 0.6530243 0.472 0.6530243 0.905 0.588 NA
You can now plot distributions of expression for each cluster for specific genes in a violin plot:
VlnPlot(s1,features.plot = c("Aldoc","Hmgb2"))
Or you can plot a longer list overlaid on tSNE plots:
FeaturePlot(s1,features.plot = c("Ptgds","Aldoc","Tubb3","Top2a","Dbi","Hmgb2","Lhx1","Notch2"),cols.use=c("grey","blue"),reduction.use="tsne")
Finally, you can visualize the top gene markers per cluster and plot a heatmap across all cells:
top10=s1.markers %>% group_by(cluster) %>% top_n(10,avg_logFC) DoHeatmap(s1,genes.use=top10$gene,slim.col.label=T,remove.key = T)
Next up, two samples…