Seurat Chapter 1: Analyzing Single Samples

As I’ve learned more about the power of Seurat, I think it’ll be clearest if I split posts into three examples:

  1. Analyzing a single sample
  2. Combining and analyzing two samples
  3. Analyzing multiple (>2) samples

Each has a slightly novel way of dealing with the data and each builds on the previous example.

Single Sample

Based on my earlier post to run raw 10X Genomics sequencing output (fastq files) on a cluster to count transcripts and interpret barcodes, this post will start with the standard directory and file structure output by the cellranger count command.

You should already have installed R and RStudio.  Install Seurat using the RStudio Packages pane.  Click “Install” and start typing “Seurat.”  The Seurat version available in CRAN should be v.2.3.3 and should load automatically along with any other required packages.

In RStudio, use the Files pane to find a convenient location for your working files and output.  Choose the “More/Set as working directory” command.  For all of the following example commands, you can also download my R script file.

Load Seurat and dplyr packages into the workspace:

library(Seurat)
library(dplyr)

For my example, I’m going to rely heavily on a vignette from the Seurat authors.  That vignette does a nice job explaining the algorithms behind each step but I’m going to focus only on the procedure and outcomes.

Load Data

To load your sample, determine the location of the directory named “filtered_gene_bc_matrices.”  Under that should be a folder named with your reference genome–in my case it’s “mm10”.  Using this location (relative to the current working directory–my working directory is adjacent to the sample directory), read the 10X Genomics output into an object.

s1.data=Read10X(data.dir="../sample1/outs/filtered_gene_bc_matrices/mm10/")

Next, create a Seurat data model from this raw data.  Seurat wants a project name (I used “iMOP”) and a filter to include only genes expressed in a minimum number of cells, here I chose 5 cells.  There are many more options you can add at this stage but for now we’ll take our analysis stepwise through normalization and scaling to see how this works.

s1=CreateSeuratObject(raw.data=s1.data,project="iMOP",min.cells=5)

Count mitochondrial genes expressed

Both as a QC step and for scaling (below), let’s count the number of mitochondrial genes we saw per cell.  First we’ll get a list of gene symbols (for mouse, all start with “mt-“).

mito.genes=grep("^mt-",rownames(x=s1.data),value=T)

I found 13 of the 37 mitochondrial genes in my sample, so this produces a vector of those 13 gene symbols.  Use the summed counts of these genes per barcode, divided by the total numbers of counts for all genes, to get percent mitochondrial for each cell.

percent.mito=Matrix::colSums(s1@raw.data[mito.genes,])/Matrix::colSums(s1@raw.data)

To get a sense of the distribution of values, use the summary command:

summary(percent.mito)
Min.     1st Qu.  Median   Mean     3rd Qu.  Max.
0.005147 0.044402 0.054328 0.057863 0.066207 0.733269

My sample has a reasonably low overall rate (mean & median ~5%) but a few cells with a high rate (max 73%).

Seurat has a convenient slot named metadata that you can use to store things like this.  This slot will come up again later when we add more samples in future posts.

s1=AddMetaData(s1,metadata = percent.mito,col.name="percent.mito")

Visualize cells and genes as distributions

You can now visualize the overall distributions of detected genes (“nGene”), numbers of unique molecules (as UMI ; “nUMI”), and the percent mitochondrial genes per cell, in the form of jittered dots overlaid on a violin plot:

VlnPlot(s1,features.plot = c("nGene","nUMI","percent.mito"),nCol=3)
Violin plot showing distributions of genes, cells, and percent mitochondrial genes per cell.

Note that the scales for each plot are different.

Relationship between the numbers of cells and (left) the percent mitochondrial genes per cell and (right) the number of detected genes per cell.

 

Here’s a plot (with two panels) to show the relationships between the number of molecules (nUMI) and the (left) percent mitochondrial genes (from the percent.mito slot we added to the metadata) and (right) the total number of genes (nGene).

par(mfrow=c(1,2))
GenePlot(s1,gene1="nUMI",gene2="percent.mito")
GenePlot(s1,gene1="nUMI",gene2="nGene")

Filter, normalize, and scale

Now let’s clean up the data with filtering and normalization. Remove cells with low gene counts (here, 500).  You can also choose to exclude cells with unusually high counts, or, as I’ve done here, set the threshold to infinity.

s1=FilterCells(s1,subset.names="nGene",low.thresholds = 500,high.thresholds = Inf)

Normalize (using default parameters):

s1=NormalizeData(s1,normalization.method = "LogNormalize", scale.factor = 10000)

Before scaling, let’s find the genes with the greatest variance by cell:

s1=FindVariableGenes(s1,do.plot=T)
Variable genes dispersion plot

Then scale, using the percent.mito as part of the regression:

s1=ScaleData(s1,vars.to.regress = c("nUMI","percent.mito"))

Principal Components Analysis

Using the most variable genes (from the FindVariableGenes function, stored in the var.genes slot), calculate principal components:

s1=RunPCA(s1,pc.genes=s1@var.genes,do.print=T,pcs.print = 1:5,genes.print = 5)

Seurat includes a number of visualization tools.  Here’s example commands for each of them:

#lists top 5 genes per PC
PrintPCA(s1,pcs.print = 1:5,genes.print = 5,use.full=F) 

#plots component of each of top genes per PC
VizPCA(s1,pcs.use = 1:2) 

#all cells plotted on first two PCs
PCAPlot(s1,dim.1=1,dim.2=2) 

#heatmap showing top genes for first PC only
PCHeatmap(s1,pc.use=1,cells.use=500,do.balanced = T,label.columns = F)
VizPlot output for first two PCs

 

 

 

 

The heatmap command can also be useful to visualize how many PCs explain variance–try 6, 12, or 18.

PCHeatmap(s1,pc.use=1:6,cells.use = 500,do.balanced = T,label.columns = F,use.full=F)
Heatmap of first 6 PCs. Note how the heatmaps become less distinct with higher-numbered PCs.

 

 

 

 

Seurat also includes a method to evaluate statistically significant PCs using jackStraw:

s1=JackStraw(s1,num.replicate=100,display.progress = T)

#to find how many are significant
JackStrawPlot(s1,PCs=1:18)
#another, simpler way to visualize 
PCElbowPlot(s1)
PCA Elbow Plot.

Clustering

The graph-based clustering method in Seurat relies on the PCA space for data reduction and uses methods similar to KNN and SLM–see the Seurat website for details.  Choose how many PC dimensions you want to include based on the elbow plot.

s1=FindClusters(s1,reduction.type = "pca",dims.use=1:10,resolution=0.6,print.output = 0,save.SNN = T)

#same PCA plot as above, now colored by cluster
PCAPlot(s1,dim.1=1,dim.2=2)

A better way to visualize clusters is to use the tSNE plot:

s1=RunTSNE(s1,dims.use=1:10,do.fast=T)
TSNEPlot(s1) # plot in tSNE space
tSNE Plot Colored by Cluster Number

Genes distinguishing clusters

You can then search for genes that distinguish clusters.  In this example, I chose cluster 1 because I knew it expressed a number of characteristic neuronal markers.  In this example I asked for genes distinguishing cluster 1 from all other clusters but you can also do binary comparisons if you like, using “ident.2” as another argument.

cluster1.markers=FindMarkers(s1,ident.1=1,min.pct=0.25)
print(head(cluster1.markers,n=5))
      p_val avg_logFC pct.1 pct.2 p_val_adj
Aldoc     0 0.9771343 0.935 0.511         0
Hes5      0 0.8220392 0.494 0.179         0
Fabp7     0 0.7561873 0.821 0.405         0
Sparc     0 0.7385394 0.780 0.372         0
Gstm1     0 0.7359628 0.726 0.386         0

Extend this to search for markers for all clusters:

s1.markers=FindAllMarkers(s1,only.pos=T,min.pct=0.25,thresh.use=0.25)
#Display a short table showing top 2 genes per cluster
s1.markers %>% group_by(cluster) %>% top_n(2,avg_logFC)
# A tibble: 16 x 7
# Groups:   cluster [8]
       p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene    
       <dbl>     <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>   
 1 7.62e- 38     0.352 0.306 0.212 1.07e- 33 0       Ptgds   
 2 4.24e- 30     0.325 0.413 0.331 5.93e- 26 0       Selenbp1
 3 0.            0.977 0.935 0.511 0.        1       Aldoc   
 4 6.58e-298     0.829 0.499 0.189 9.21e-294 1       Apoe    
 5 0.            1.36  0.757 0.298 0.        2       Tubb3   
 6 0.            1.16  0.729 0.337 0.        2       Mllt11  
 7 0.            1.37  0.686 0.145 0.        3       Top2a   
 8 0.            1.23  0.851 0.326 0.        3       Hmgb2   
 9 0.            1.51  0.978 0.647 0.        4       Dbi     
10 0.            1.37  0.603 0.123 0.        4       Ntrk2   
11 3.49e- 74     0.512 0.707 0.38  4.88e- 70 5       Hmgb2   
12 7.08e- 67     0.489 0.473 0.204 9.91e- 63 5       Top2a   
13 0.            1.29  0.61  0.103 0.        6       Lhx1    
14 1.15e-222     1.12  0.712 0.242 1.61e-218 6       Cks1b   
15 0.            1.67  0.766 0.137 0.        7       Notch2  
16 1.72e-174     0.944 0.317 0.052 2.40e-170 7       Slc39a1 

This is only one of the possible differential expression tests available–see this vignette to get a list of all of them.  For one more example, we’ll use the “roc” method to identify differential genes:

cluster1.roc=FindMarkers(s1,ident.1 = 1,thresh.use=0.25,test.use="roc",only.pos=T)
print(head(cluster1.roc,n=5))
      myAUC  avg_diff power avg_logFC pct.1 pct.2 p_val_adj
Aldoc 0.807 0.9771343 0.614 0.9771343 0.935 0.511        NA
Cst3  0.793 0.7022215 0.586 0.7022215 0.997 0.848        NA
Ptn   0.746 0.6726631 0.492 0.6726631 0.918 0.576        NA
Fabp7 0.736 0.7561873 0.472 0.7561873 0.821 0.405        NA
Itm2b 0.736 0.6530243 0.472 0.6530243 0.905 0.588        NA

You can now plot distributions of expression for each cluster for specific genes in a violin plot:

VlnPlot(s1,features.plot = c("Aldoc","Hmgb2"))
Violin plot for two genes split by cluster number

Or you can plot a longer list overlaid on tSNE plots:

FeaturePlot(s1,features.plot = c("Ptgds","Aldoc","Tubb3","Top2a","Dbi","Hmgb2","Lhx1","Notch2"),cols.use=c("grey","blue"),reduction.use="tsne")
tSNE plot highlighting top genes per cluster

Finally, you can visualize the top gene markers per cluster and plot a heatmap across all cells:

top10=s1.markers %>% group_by(cluster) %>% top_n(10,avg_logFC)
DoHeatmap(s1,genes.use=top10$gene,slim.col.label=T,remove.key = T)
Heatmap for top genes per cluster, for each cell within each cluster

Next up, two samples…

 

 

Advertisement

7 thoughts on “Seurat Chapter 1: Analyzing Single Samples”

  1. hi! I’m a little new to this so I might be wrong, but I believe UMIs are actually used to tag unique mRNA molecules, not cells (they’re referred to as cells several times here, eg during the QC section). let me know if I have this mixed up. thanks!

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: