R Script Seurat with a Singularity Container Using SLURM

container(Updated for Singularity v3, Ubuntu 18.04, and R 3.6.1)

A previous post provide a step-by-step example for setting up a singularity container for use on the HPC (in my case, Perceval).  Here we’ll see how to build a more complex singularity recipe, create a distributable container, and use it to run a few steps of Seurat as an Rscript batch file. This approach allows you to run any version of R and its packages that you need on the HPC in a secure container and also to break free of the limitations of running R on your desktop.

Permissions

A key feature of singularity is the control of user-level access and security.  This means that you don’t need to have any root privileges on the HPC.  However, you do need to be root to build your container on a local machine. To summarize:

  • All steps to build the container using the recipe file and any manual additions need to be run as root (sudo) on your local system.
  • Running the container on the HPC cannot be done as root.  You will use only your assigned username permissions.

The Container Recipe

I downloaded a few sample recipe files from the Singularity Hub, including one that did a nice job of setting up for running R or Rscript.  You can download my recipe file from this link.  I’ll highlight some of the sections to explain how each works.

Bootstrap: docker
From: ubuntu:18.04
IncludeCmd: yes

Use the Docker repository to download Ubuntu Linux v. 18.04 (Bionic), which is quite stable.

Next is the %environment section–some of the lines I just copied as is from the downloaded sample file.

%environment
R_VERSION=3.6.1
export R_VERSION
R_CONFIG_DIR=/etc/R/
export R_CONFIG_DIR
export LC_ALL=C
export PATH=$PATH

The %labels section seems to be somewhat arbitrary but indicates labels useful for tracking your versions, at least.

%labels
Author Ron Hart
Version v0.0.5
R_Version 3.6.1
build_date 2019 Aug 26
R_bioconductor True

Next is %apprun and %runscript.  After you’ve built the container, these describe your desired behaviors when you run “singularity run” (run the %runscript command) or “singularity run –app Rscript” (run the %apprun Rscript command).  Here’s the lines from the file:

%apprun R
exec R "$@"

%apprun Rscript
exec Rscript "$@"

%runscript
exec R "$@"

In each of these the “$@” means to append this command with arguments passed from the singularity run command.  More about that later.

The final section is the more important and useful one.  The %post section is run after you’ve bootstrapped the container with Ubuntu loaded.  Here you issue standard Ubuntu commands to load all the packages you’ll need.  First the system packages:

%post
  apt-get update
  apt-get install -y apt-transport-https apt-utils software-properties-common
  apt-get install -y add-apt-key
  export DEBIAN_FRONTEND=noninteractive
  ln -fs /usr/share/zoneinfo/America/New_York /etc/localtime
  apt-get install -y tzdata
  dpkg-reconfigure --frontend noninteractive tzdata

  #add CRAN/Ubuntu repo, add key, then refresh
  apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
  add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/'
  apt-get update

  apt-get install -y wget nano
  apt-get install -y libblas3 libblas-dev liblapack-dev liblapack3 curl
  apt-get install -y gcc fort77 aptitude
  aptitude install -y g++
  aptitude install -y xorg-dev
  aptitude install -y libreadline-dev
  aptitude install -y gfortran
  gfortran --version
  apt-get install -y libssl-dev libxml2-dev libpcre3-dev liblzma-dev libbz2-dev libcurl4-openssl-dev 
  apt-get install -y libhdf5-dev hdf5-helpers libmariadb-client-lgpl-dev

  apt-get install -y r-base r-base-dev
  
  R --version
  
  # installing packages from cran
  R --slave -e 'install.packages("devtools",repos="https://cran.rstudio.com/")'
  R --slave -e 'install.packages("dplyr",repos="https://cran.rstudio.com/")'
  R --slave -e 'install.packages("rhdr5",repos="https://cran.rstudio.com/")'
  R --slave -e 'install.packages("Seurat",repos="https://cran.rstudio.com/")'

  # installing from bioc
  R --slave -e 'if (!requireNamespace("BiocManager",quietly=TRUE)) install.packages("BiocManager")'
  R --slave -e 'BiocManager::install()'
  R --slave -e 'BiocManager::install("tximport",ask=FALSE)'
  R --slave -e 'BiocManager::install("biomaRt",ask=FALSE)'
  R --slave -e 'BiocManager::install("DESeq2",ask=FALSE)'

 

This took me some work to figure out what steps were required and the correct order of execution.

  1. Start by refreshing the apt-get data from the pre-installed repositories.
  2. Due to the relatively new tzdata settings, you need to specify your location otherwise you’ll be asked for this interactively in the middle of the build.  Modify your setting as needed.
  3. Because we will add a new repository with a secure connection (https), we need to grab the secure transport tool.  Also, we’ll need to install some utils and common tools to be able to use command-line to add repositories and keys.
  4. Once those are installed, we can add the CRAN/Ubuntu repository using a command (instead of editing it into the /etc/apt/sources.list system file).  After that, add the public key for this repository.  Then refresh the repositories.
  5. Next we need several package that are required for R installation (which requires compilation) and some extraneous packages required by the specific R packages that we’ll load later.  The wget and nano packages are included for convenience but aren’t really needed.  The gfortran –version is just a safety check that fortan got installed correctly–it would throw an error if the install didn’t work.  We will need the hdf5 and mariadb packages.  If you don’t want all the packages I listed, you can delete some of these.
  6. The last apt-get installs R.
  7. Next we’ll use several steps to install packages in R.  Using –slave eliminates most unneeded output.  The -e flag says to execute the string in quotes from the command-line.  I specified which repos to use just in case.
  8. After some cran installs, I switch to BiocManager::install() from bioconductor for some packages.

With this recipe, you can first create a sandbox to test it (if you wish), using this command on a local Linux system:

sudo singularity build --sandbox biocBox/ Singularity.bioconductor-make.txt

This takes a while to run on my Linux desktop, maybe ~30 minutes.  When it’s done you can sudo singularity shell into the sandbox and check it.  If everything works, go ahead and convert to a non-writable image file:

sudo singularity build bioconductorBox.simg biocBox/

This produced an image file that was only 569M!  Transfer this to your space on the HPC.  I used a “containers” folder under my /scratch space:

scp bioconductorBox.simg user@perceval.rutgers.edu:/scratch/user/containers/.

Rscript file

We’ll need a file containing R commands that can run non-interactively.  That is, if you’re used to working in RStudio (as I am) you need to consider every step to make sure it will work without you typing anything.  The most important part is to capture and explicitly save any graphics output into files.

For this example, I’ll take commands used from the blog post using Seurat with single samples.  The Rscript file can be downloaded here. (This is an old version that might not be compatible with the current Singularity build.

The most important and possibly confusing part is where I specify the working directory.  Since I plan to mount my /scratch/user under the /mnt mountpoint, the location that is actually /scratch/user/sample1 needs to be called as /mnt/sample1 when running inside the container.

I played around with using ggplot() for the ggplot2 objects but I found that the more basic pdf(), followed by the plot command, followed by dev.off() works more consistently.

Any text output will be collected into files that will be specified in my SLURM script.

At the end, I explicitly save the workspace into a standard .RData file.  This allows me to download the file to my desktop, open it with RStudio, and continue data exploration.  All the data are loaded, normalized and scaled.  When multiple samples are combined, you should consider also deleting (rm()) all the original single-sample objects to make a smaller workspace.

SLURM Script

Here comes the best part–running the container with your data.  It’s possible to use singularity script or run from command-line if you request a compute node–this allows you to type commands as you go.  But this post is about batch processing.  The final SLURM script can be downloaded here.

Because we’ll be running Seurat, which is currently single-threaded, we only ask for one node, one task, but a good amount (64 Gb) of RAM.  Remember to give the script a reasonable execution time estimate.  The STDOUT and STDERR output are redirected to files for later review.  Finally, it’s not necessary but I like to get an email when it’s done.

The first step is to manually load the singularity module.  As before, the module is hidden from the module spider command because it’s currently considered to be “under development” on our system.  This is designated by the dot in front of the version.

The srun command calls singularity with the run option, but then specifies which –app to run.  Remember that the recipe file gives us three behaviors, two run R (either run or run –app R) and one runs Rscript.

At this point we can also mount the filespace we plan to use with a –bind command.  This allows us to use /mnt to find all the files under /scratch/user.

Next is the name of the container to run.

Last is the argument we want to pass to the exec Rscript “$@” command.  By the time this command is run, we’re working inside the container so we need the address after mounting the /scratch space, where I put the R script file in my containers folder.  This took me several tries to figure out.

Here’s the full srun command:

srun singularity run --app Rscript --bind /scratch/user:/mnt bioconductorBox.simg /mnt/containers/autoSeurat.R

Run it!

Once the data are loaded (in my example, under /scratch/user/sample1), the container image, the R script file, and the SLURM script are loaded (under /scratch/user/containers), you’re ready to go.  Be sure to create an output folder under containers to catch the STDOUT and STDERR files.

sbatch singBio.sh

You should get an email saying that it ran without errors.  Your STDOUT and STDERR output are in the containers/output folder.  The R script saved files are all in the working directory you specified in the R script file (/scratch/user/sample1/outs/filtered_gene_bc_matrices/mm10).

Advertisement

Seurat Chapter 2: Two Samples

We’ve already seen how to load data into a Seurat object and explore sub-populations of cells within a sample, but often we’ll want to compare two samples, such as drug-treated vs. control.  In this example we’ll use one sample made from a proliferating neuronal precursor cells (“Prolif”) and one that’s been differentiated into post-mitotic neurons (“Neurons”).  A key aspect of doing this with single-cell RNAseq is that we won’t assume that either population is uniform.

As with previous posts, much of this follows vignettes posted on the Seurat website very closely.

Computing Resources

Before beginning, we should consider that these analyses require substantial computing resources.  I use a Windows 10 desktop with 10 physical processors (20 logical processors; although most R functions are not multi-threaded by default) and 64 Gb of RAM.  Some of the steps here took more than 10 minutes to run on this system.  Saving the final data environment to an .RData file can take a while.  So think about where to run this kind of work before you start on an old Mac laptop!

Load Data and Create Objects

As with the single-sample example, the steps are to load the 10X Genomics cellranger output into a data object, create a Seurat object, add metadata, filter, normalize, and scale.  But before combining two objects, we need to add a sample-specific identifier to each UMI.

First, load the libraries and the data:

library(Seurat)
library(dplyr)

#create samples objects for samples 2 and 4 
s2.data=Read10X(data.dir="../../sample2/outs/filtered_gene_bc_matrices/mm10/")
s4.data=Read10X(data.dir="../../sample4/outs/filtered_gene_bc_matrices/mm10/")

Seurat provides a function “RenameCells” but I could never get that to work as expected.  So I found a simple trick to use standard R functions (paste) to add a sample-specific string to each UMI string.

#prepend cell IDs (UMIs) with sample identifier to distinguish samples after merging
colnames(x = s2.data) <- paste('s2', colnames(x = s2.data), sep = '_')
colnames(x = s4.data) <- paste('s4', colnames(x = s4.data), sep = '_')

Now each UMI (for example, “AAACCTGAGCGAGAAA,” which is likely found in every sample) is replaced by one that distinguishes cells from each sample (such as “s2_AAACCTGAGCGAGAAA“).

Now we’re ready to create and pre-process our two Seurat objects:

#for each, create object, add metadata, filter, normalize, and scale
s2=CreateSeuratObject(raw.data=s2.data,project="iMOP",min.cells=5)
s2@meta.data$group="Neurons"
s2=FilterCells(s2,subset.names="nGene",low.thresholds = 500,high.thresholds = Inf)
s2=NormalizeData(s2)
s2=ScaleData(s2,display.progress = T)

s4=CreateSeuratObject(raw.data=s4.data,project="iMOP",min.cells=5)
s4@meta.data$group="Prolif"
s4=FilterCells(s4,subset.names="nGene",low.thresholds = 500,high.thresholds = Inf)
s4=NormalizeData(s4)
s4=ScaleData(s4,display.progress = T)

Combine Samples with CCA

Before proceeding to a canonical correlation analysis (CCA, which also combines the two samples), let’s find the 1000 genes from each sample with the highest dispersion.  Then we’ll combine the two lists and confirm that they are found in both samples.

#select variable genes common to both samples
s2=FindVariableGenes(s2,do.plot=T)
s4=FindVariableGenes(s4,do.plot=T)
g.2=head(rownames(s2@hvg.info),1000)
g.4=head(rownames(s4@hvg.info),1000)
genes.use=unique(c(g.2,g.4))
genes.use=intersect(genes.use,rownames(s2@scale.data))
genes.use=intersect(genes.use,rownames(s4@scale.data))

We can now use RunCCA to combine the two samples and also to identify common sources of variation between the two datasets.

#do CCA
agg=RunCCA(s2,s4,genes.use=genes.use,num.cc=30)

To visualize the overlap of the two samples in CCA space and also to check distribution of expression signals, we can create two diagnostic plots:

p1=DimPlot(object=agg,reduction.use="cca",group.by="group",pt.size=0.5,do.return=T)
p2=VlnPlot(object=agg,features.plot="CC1",group.by="group",do.return=T)
plot_grid(p1,p2)
CCA plots
CCA space plot and violin plot of abundance per sample.

The PrintDim function outputs the top distinguishing genes in each CCA dimension.

PrintDim(object=agg,reduction.type="cca",dims.print=1:2,genes.print=10)

Finally, this plot shows the smoothed shared correlation strength versus CCA dimension number, to evaluate how many dimensions are useful.

MetageneBicorPlot(agg,grouping.var="group",dims.eval=1:30,display.progress=T)

MetageneBicorplot

A heatmap is plotted to associate the most variable genes with each cluster.  For this example, I only plotted the first 9 CCs.

DimHeatmap(object=agg,reduction.type="cca",cells.use=500,dim.use=1:9,do.balanced=T)

 

With this you can now align the data to the CCA subspace–choose the number of CC dimensions that make sense for your sample.  Note that each of these dimension reduction steps produces a new set of data under the @dr slot, so you can refer to this for clustering.  After this point, you will have both “cca” and “cca.aligned” under this slot.

agg=AlignSubspace(agg,reduction.type="cca",grouping.var="group",dims.align=1:16)

Let’s plot distributions, as violin plots, for each of the first two CC dimensions.

p1=VlnPlot(object=agg,features.plot="ACC1",group.by="group",do.return=T)
p2=VlnPlot(object=agg,features.plot="ACC2",group.by="group",do.return=T)
plot_grid(p1,p2)

aligned CCA first two dims

tSNE

One visualization method is to project the data into tSNE space.  We can also use the cca.aligned data to find clusters.  Note that you can specify how many CC dimensions to use for clustering and also specify the “resolution.”  A resolution greater than one favors more clusters, less than one favors fewer clusters.  Plot the tSNE space showing the sample identifier (“group”) and the clusters.

agg=RunTSNE(agg,reduction.use="cca.aligned",dims.use=1:16,do.fast=T)
agg=FindClusters(agg,reduction.type="cca.aligned",resolution=0.6,dims.use=1:16)
p1=TSNEPlot(agg,do.return=T,pt.size=0.5,group.by="group")
p2=TSNEPlot(agg,do.label=T,do.return=T,pt.size=0.5)
plot_grid(p1,p2)

tSNE sample vs clusters

Interestingly, with this dataset, tSNE did not turn out to separate the proliferating cells well from the neurons.  There’s also a new @dr dataset named “tsne”.  There’s 8 clusters and some clear overlap with samples, but it’s kind of a mess.

Principal Components Analysis

So next I tried principal components.  For this I used only the @var.genes slot of the combined object, which has fewer genes than the genes.use list created above.  I ask for a list of 5 distinguishing genes for each of the first 5 principal components.

agg=RunPCA(agg,pc.genes=agg@var.genes,do.print=T,pcs.print = 1:5,genes.print = 5)

Here’s the output:

[1] "PC1"
[1] "Malat1" "Cst3"   "mt-Co1" "mt-Co3" "Itm2b" 
[1] ""
[1] "Eif5a" "Lyar"  "H2afz" "Ncl"   "Ran"  
[1] ""
[1] ""
[1] "PC2"
[1] "Rpl32"  "Rps5"   "Rps4x"  "Eef1a1" "Rps7"  
[1] ""
[1] "Dbi"     "Lgals1"  "Adh7"    "Igfbp2"  "mt-Atp6"
[1] ""
[1] ""
[1] "PC3"
[1] "Phgdh" "Aldoc" "Eef1d" "Ptn"   "Eif1" 
[1] ""
[1] "mt-Nd1"  "Sox11"   "Elavl3"  "mt-Nd2"  "mt-Atp6"
[1] ""
[1] ""
[1] "PC4"
[1] "Rps27a" "Rpl37"  "Rps23"  "Rpl32"  "Rpl26" 
[1] ""
[1] "Tubb3"  "Rtn1"   "Elavl3" "Stmn1"  "Tuba1a"
[1] ""
[1] ""
[1] "PC5"
[1] "Tubb3"  "Tuba1a" "Stmn2"  "Elavl3" "Calm2" 
[1] ""
[1] "Ybx3"     "Mtdh"     "mt-Nd2"   "Kcnq1ot1" "Rplp1"   
[1] ""
[1] ""

This looks promising (based on the genes).  Note that it also adds another dataset in the @dr slot named “pca”.  Try more visualizations.

#various ways to show output from PCA
VizPCA(agg,pcs.use = 1:2) #plots component of each of top genes per PC

vizpca

PCAPlot(agg,dim.1=1,dim.2=2) #all cells plotted on first two PCs

pca plot clusters

 PCAPlot(agg,dim.1=1,dim.2=2,group.by="group") #show source samples

pca plot groups

This shows a very clear distinction between the starting proliferating cells and the resulting neurons.  That’s the kind of display I was looking for.  Let’s see which genes distinguish a few PCs.

PCHeatmap(agg,pc.use=1,cells.use=500,do.balanced = T,label.columns = F) #first PC only

PC1 heatmap

#try running more PCs to visualize how many explain variance
PCHeatmap(agg,pc.use=1:6,cells.use = 500,do.balanced = T,label.columns = F,use.full=F)

PC heatmap 1-6

Looks good.  Next, project all the data onto PC space for differential expression analysis.

agg=ProjectPCA(agg,do.print = F)Differential Expression by Sample

Differential Expression by Sample

Before re-clustering in PCA space, let’s get lists of genes that are differentially expressed by input sample.  To do this, we’ll overwrite the @ident slot (which contains the cluster identities from the first clustering attempt) with sample group names (from the metadata).

agg=SetAllIdent(agg,id="group")
cell.markers=FindMarkers(agg,ident.1="Neurons",ident.2="Prolif",test.use="wilcox")

With this list of DE genes, we can also visualize results as though we had standard RNAseq samples, by averaging the cells within a group and plotting a scatterplot. The authors of Seurat posted a few nice functions for adding labels to a few gene dots on this plot, which you can download from this page.  I stored the R code for the functions in a separate file, named immune_alignment_functions.R, which I source to load the functions.

prolif.mrkrs=rownames(head(cell.markers[order(-cell.markers$pct.2 + cell.markers$pct.1),],5))
neu.mrkrs=rownames(head(cell.markers[order(cell.markers$pct.2 - cell.markers$pct.1),],5))

#create averaged data model
avg.cells=log1p(AverageExpression(agg,show.progress = F))
avg.cells$gene=rownames(avg.cells)

#load scripts for labeling scatterplot
source('immune_alignment_functions.R')

#plot averaged data highlighting greatest differences
p1=ggplot(avg.cells,aes(Neurons,Prolif))+geom_point(size=.75)
p1=LabelUR(p1,genes=neu.mrkrs,avg.cells,adj.u.t=.3,adj.u.s=.23)
p1=LabelUL(p1,genes=prolif.mrkrs,avg.cells,adj.u.t=.5,adj.u.s=.4,adj.l.t=.25,adj.l.s=.25)
plot(p1)

DE scatterplot

You can also view the top DE genes as an array of PCA space plots:

FeaturePlot(agg,features.plot = c(neu.mrkrs,prolif.mrkrs),cols.use=c("grey","blue"),reduction.use="pca")

DE genes PCA plots

This shows a nice partition of neuron markers extending to the left and proliferative cell markers to the right.

JackStraw PCA

Seurat includes a more robust function for finding statistically significant PCs through the jackStraw algorithm.  Try this an plot output.

agg=JackStraw(agg,num.replicate=100,display.progress = T)
JackStrawPlot(agg,PCs=1:18) #to find how many are significant
PCElbowPlot(agg) #another, simpler way to visualize

pc elbow plot

For me, the elbow plot is the most useful.  It seems to say that the first 9 or 10 PCs really capture the majority of the variance.

Clustering and DE in PCA Space

So using only PCs 1:9, let’s try clustering in PCA space.  I set the k to 2 (intending to find clusters focused on my two samples) and very low resolution of 0.1 to push towards fewer clusters at this point.

agg=FindClusters(agg,genes.use=agg@var.genes,reduction.type="pca",dims.use=1:9,k.param=4,save.SNN=T,force.recalc=T,plot.SNN=F,resolution=.1)
p1=PCAPlot(agg,dim.1=1,dim.2=2,group.by="group")
p2=PCAPlot(agg,dim.1=1,dim.2=2,group.by="ident")
plot_grid(p1,p2)

pca cluster plots

Looks good–there’s a clear distinction between neurons and proliferating cells but a separation within each group.  Let’s find which genes distinguish all four clusters.

all.markers=FindAllMarkers(agg,only.pos=T,min.pct=0.25,thresh.use=0.25)
all.markers %>% group_by(cluster) %>% top_n(5,avg_logFC)

This is the output showing the top 5 genes per cluster:

# A tibble: 20 x 7
# Groups:   cluster [4]
   p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene  
   <dbl>     <dbl> <dbl> <dbl>     <dbl> <fct>   <chr> 
 1     0      2.47 0.986 0.514         0 0       Cst3  
 2     0      2.04 0.864 0.246         0 0       Itm2b 
 3     0      2.02 1     0.911         0 0       Malat1
 4     0      1.79 0.787 0.249         0 0       Ramp1 
 5     0      1.73 0.857 0.484         0 0       Cd81  
 6     0      1.30 0.998 0.617         0 1       Eif5a 
 7     0      1.28 0.779 0.223         0 1       Vps8  
 8     0      1.26 0.925 0.209         0 1       Ddx21 
 9     0      1.21 0.977 0.416         0 1       Tomm5 
10     0      1.20 0.936 0.264         0 1       Srm   
11     0      1.52 0.608 0.088         0 2       Tubb3 
12     0      1.18 0.878 0.474         0 2       Sox11 
13     0      1.13 0.755 0.325         0 2       Stmn1 
14     0      1.10 0.398 0.02          0 2       Elavl3
15     0      1.04 0.664 0.345         0 2       Mllt11
16     0      1.16 0.961 0.602         0 3       Eif2s2
17     0      1.13 0.991 0.773         0 3       Eif4a1
18     0      1.12 0.979 0.595         0 3       Ldha  
19     0      1.11 1     0.922         0 3       Npm1  
20     0      1.05 0.731 0.346         0 3       Ppa1  

Make a table of the top 2 genes per cluster and plot dots showing which genes best characterize which cluster, split by sample group.

top2=all.markers %>% group_by(cluster) %>% top_n(2,avg_logFC)
SplitDotPlotGG(agg,genes.plot=as.character(top2$gene),cols.use = c("blue","red"),x.lab.rot = T,plot.legend = T,dot.scale = 8,do.return = T,grouping.var = "group")

splitdotplot pca clusters

FeaturePlot(agg,features.plot = as.character(top2$gene),cols.use=c("grey","blue"),reduction.use="pca")

pca feature top2 genes per cluster

I think that clusters 1 and 3 best represents the proliferating cells, so let’s re-draw the scatterplot, labeling these top cluster genes appropriately (left/right labeling).

p1=ggplot(avg.cells,aes(Neurons,Prolif))+geom_point(size=.75)
p1=LabelUR(p1,genes=subset(top2,cluster==0 | cluster==2 )$gene,avg.cells,adj.u.t=.3,adj.u.s=.23)
p1=LabelUL(p1,genes=subset(top2,cluster==1 | cluster==3 )$gene,avg.cells,adj.u.t=.5,adj.u.s=.4,adj.l.t=.25,adj.l.s=.25)
plot(p1)

scatter pca clusters top2

There’s more to do–you can dump the all.markers object to a file.  We can consider more clusters by increasing the resolution value in the FindAllMarkers step.  Let the biology guide you!

 

Seurat Chapter 1: Analyzing Single Samples

As I’ve learned more about the power of Seurat, I think it’ll be clearest if I split posts into three examples:

  1. Analyzing a single sample
  2. Combining and analyzing two samples
  3. Analyzing multiple (>2) samples

Each has a slightly novel way of dealing with the data and each builds on the previous example.

Single Sample

Based on my earlier post to run raw 10X Genomics sequencing output (fastq files) on a cluster to count transcripts and interpret barcodes, this post will start with the standard directory and file structure output by the cellranger count command.

You should already have installed R and RStudio.  Install Seurat using the RStudio Packages pane.  Click “Install” and start typing “Seurat.”  The Seurat version available in CRAN should be v.2.3.3 and should load automatically along with any other required packages.

In RStudio, use the Files pane to find a convenient location for your working files and output.  Choose the “More/Set as working directory” command.  For all of the following example commands, you can also download my R script file.

Load Seurat and dplyr packages into the workspace:

library(Seurat)
library(dplyr)

For my example, I’m going to rely heavily on a vignette from the Seurat authors.  That vignette does a nice job explaining the algorithms behind each step but I’m going to focus only on the procedure and outcomes.

Load Data

To load your sample, determine the location of the directory named “filtered_gene_bc_matrices.”  Under that should be a folder named with your reference genome–in my case it’s “mm10”.  Using this location (relative to the current working directory–my working directory is adjacent to the sample directory), read the 10X Genomics output into an object.

s1.data=Read10X(data.dir="../sample1/outs/filtered_gene_bc_matrices/mm10/")

Next, create a Seurat data model from this raw data.  Seurat wants a project name (I used “iMOP”) and a filter to include only genes expressed in a minimum number of cells, here I chose 5 cells.  There are many more options you can add at this stage but for now we’ll take our analysis stepwise through normalization and scaling to see how this works.

s1=CreateSeuratObject(raw.data=s1.data,project="iMOP",min.cells=5)

Count mitochondrial genes expressed

Both as a QC step and for scaling (below), let’s count the number of mitochondrial genes we saw per cell.  First we’ll get a list of gene symbols (for mouse, all start with “mt-“).

mito.genes=grep("^mt-",rownames(x=s1.data),value=T)

I found 13 of the 37 mitochondrial genes in my sample, so this produces a vector of those 13 gene symbols.  Use the summed counts of these genes per barcode, divided by the total numbers of counts for all genes, to get percent mitochondrial for each cell.

percent.mito=Matrix::colSums(s1@raw.data[mito.genes,])/Matrix::colSums(s1@raw.data)

To get a sense of the distribution of values, use the summary command:

summary(percent.mito)
Min.     1st Qu.  Median   Mean     3rd Qu.  Max.
0.005147 0.044402 0.054328 0.057863 0.066207 0.733269

My sample has a reasonably low overall rate (mean & median ~5%) but a few cells with a high rate (max 73%).

Seurat has a convenient slot named metadata that you can use to store things like this.  This slot will come up again later when we add more samples in future posts.

s1=AddMetaData(s1,metadata = percent.mito,col.name="percent.mito")

Visualize cells and genes as distributions

You can now visualize the overall distributions of detected genes (“nGene”), numbers of unique molecules (as UMI ; “nUMI”), and the percent mitochondrial genes per cell, in the form of jittered dots overlaid on a violin plot:

VlnPlot(s1,features.plot = c("nGene","nUMI","percent.mito"),nCol=3)
Violin plot showing distributions of genes, cells, and percent mitochondrial genes per cell.

Note that the scales for each plot are different.

Relationship between the numbers of cells and (left) the percent mitochondrial genes per cell and (right) the number of detected genes per cell.

 

Here’s a plot (with two panels) to show the relationships between the number of molecules (nUMI) and the (left) percent mitochondrial genes (from the percent.mito slot we added to the metadata) and (right) the total number of genes (nGene).

par(mfrow=c(1,2))
GenePlot(s1,gene1="nUMI",gene2="percent.mito")
GenePlot(s1,gene1="nUMI",gene2="nGene")

Filter, normalize, and scale

Now let’s clean up the data with filtering and normalization. Remove cells with low gene counts (here, 500).  You can also choose to exclude cells with unusually high counts, or, as I’ve done here, set the threshold to infinity.

s1=FilterCells(s1,subset.names="nGene",low.thresholds = 500,high.thresholds = Inf)

Normalize (using default parameters):

s1=NormalizeData(s1,normalization.method = "LogNormalize", scale.factor = 10000)

Before scaling, let’s find the genes with the greatest variance by cell:

s1=FindVariableGenes(s1,do.plot=T)
Variable genes dispersion plot

Then scale, using the percent.mito as part of the regression:

s1=ScaleData(s1,vars.to.regress = c("nUMI","percent.mito"))

Principal Components Analysis

Using the most variable genes (from the FindVariableGenes function, stored in the var.genes slot), calculate principal components:

s1=RunPCA(s1,pc.genes=s1@var.genes,do.print=T,pcs.print = 1:5,genes.print = 5)

Seurat includes a number of visualization tools.  Here’s example commands for each of them:

#lists top 5 genes per PC
PrintPCA(s1,pcs.print = 1:5,genes.print = 5,use.full=F) 

#plots component of each of top genes per PC
VizPCA(s1,pcs.use = 1:2) 

#all cells plotted on first two PCs
PCAPlot(s1,dim.1=1,dim.2=2) 

#heatmap showing top genes for first PC only
PCHeatmap(s1,pc.use=1,cells.use=500,do.balanced = T,label.columns = F)
VizPlot output for first two PCs

 

 

 

 

The heatmap command can also be useful to visualize how many PCs explain variance–try 6, 12, or 18.

PCHeatmap(s1,pc.use=1:6,cells.use = 500,do.balanced = T,label.columns = F,use.full=F)
Heatmap of first 6 PCs. Note how the heatmaps become less distinct with higher-numbered PCs.

 

 

 

 

Seurat also includes a method to evaluate statistically significant PCs using jackStraw:

s1=JackStraw(s1,num.replicate=100,display.progress = T)

#to find how many are significant
JackStrawPlot(s1,PCs=1:18)
#another, simpler way to visualize 
PCElbowPlot(s1)
PCA Elbow Plot.

Clustering

The graph-based clustering method in Seurat relies on the PCA space for data reduction and uses methods similar to KNN and SLM–see the Seurat website for details.  Choose how many PC dimensions you want to include based on the elbow plot.

s1=FindClusters(s1,reduction.type = "pca",dims.use=1:10,resolution=0.6,print.output = 0,save.SNN = T)

#same PCA plot as above, now colored by cluster
PCAPlot(s1,dim.1=1,dim.2=2)

A better way to visualize clusters is to use the tSNE plot:

s1=RunTSNE(s1,dims.use=1:10,do.fast=T)
TSNEPlot(s1) # plot in tSNE space
tSNE Plot Colored by Cluster Number

Genes distinguishing clusters

You can then search for genes that distinguish clusters.  In this example, I chose cluster 1 because I knew it expressed a number of characteristic neuronal markers.  In this example I asked for genes distinguishing cluster 1 from all other clusters but you can also do binary comparisons if you like, using “ident.2” as another argument.

cluster1.markers=FindMarkers(s1,ident.1=1,min.pct=0.25)
print(head(cluster1.markers,n=5))
      p_val avg_logFC pct.1 pct.2 p_val_adj
Aldoc     0 0.9771343 0.935 0.511         0
Hes5      0 0.8220392 0.494 0.179         0
Fabp7     0 0.7561873 0.821 0.405         0
Sparc     0 0.7385394 0.780 0.372         0
Gstm1     0 0.7359628 0.726 0.386         0

Extend this to search for markers for all clusters:

s1.markers=FindAllMarkers(s1,only.pos=T,min.pct=0.25,thresh.use=0.25)
#Display a short table showing top 2 genes per cluster
s1.markers %>% group_by(cluster) %>% top_n(2,avg_logFC)
# A tibble: 16 x 7
# Groups:   cluster [8]
       p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene    
       <dbl>     <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>   
 1 7.62e- 38     0.352 0.306 0.212 1.07e- 33 0       Ptgds   
 2 4.24e- 30     0.325 0.413 0.331 5.93e- 26 0       Selenbp1
 3 0.            0.977 0.935 0.511 0.        1       Aldoc   
 4 6.58e-298     0.829 0.499 0.189 9.21e-294 1       Apoe    
 5 0.            1.36  0.757 0.298 0.        2       Tubb3   
 6 0.            1.16  0.729 0.337 0.        2       Mllt11  
 7 0.            1.37  0.686 0.145 0.        3       Top2a   
 8 0.            1.23  0.851 0.326 0.        3       Hmgb2   
 9 0.            1.51  0.978 0.647 0.        4       Dbi     
10 0.            1.37  0.603 0.123 0.        4       Ntrk2   
11 3.49e- 74     0.512 0.707 0.38  4.88e- 70 5       Hmgb2   
12 7.08e- 67     0.489 0.473 0.204 9.91e- 63 5       Top2a   
13 0.            1.29  0.61  0.103 0.        6       Lhx1    
14 1.15e-222     1.12  0.712 0.242 1.61e-218 6       Cks1b   
15 0.            1.67  0.766 0.137 0.        7       Notch2  
16 1.72e-174     0.944 0.317 0.052 2.40e-170 7       Slc39a1 

This is only one of the possible differential expression tests available–see this vignette to get a list of all of them.  For one more example, we’ll use the “roc” method to identify differential genes:

cluster1.roc=FindMarkers(s1,ident.1 = 1,thresh.use=0.25,test.use="roc",only.pos=T)
print(head(cluster1.roc,n=5))
      myAUC  avg_diff power avg_logFC pct.1 pct.2 p_val_adj
Aldoc 0.807 0.9771343 0.614 0.9771343 0.935 0.511        NA
Cst3  0.793 0.7022215 0.586 0.7022215 0.997 0.848        NA
Ptn   0.746 0.6726631 0.492 0.6726631 0.918 0.576        NA
Fabp7 0.736 0.7561873 0.472 0.7561873 0.821 0.405        NA
Itm2b 0.736 0.6530243 0.472 0.6530243 0.905 0.588        NA

You can now plot distributions of expression for each cluster for specific genes in a violin plot:

VlnPlot(s1,features.plot = c("Aldoc","Hmgb2"))
Violin plot for two genes split by cluster number

Or you can plot a longer list overlaid on tSNE plots:

FeaturePlot(s1,features.plot = c("Ptgds","Aldoc","Tubb3","Top2a","Dbi","Hmgb2","Lhx1","Notch2"),cols.use=c("grey","blue"),reduction.use="tsne")
tSNE plot highlighting top genes per cluster

Finally, you can visualize the top gene markers per cluster and plot a heatmap across all cells:

top10=s1.markers %>% group_by(cluster) %>% top_n(10,avg_logFC)
DoHeatmap(s1,genes.use=top10$gene,slim.col.label=T,remove.key = T)
Heatmap for top genes per cluster, for each cell within each cluster

Next up, two samples…

 

 

Tuxedo2 Pipeline on Perceval

The new Tuxedo RNAseq pipeline (HISAT2->StringTie->Ballgown) can be set up to take good advantage of the parallelization available on the Perceval high performance computing cluster.  Several steps in the process are run independently for each RNA sample.  This document gives an example of one efficient method for using the power of SLURM arrays to run these parallel steps with a set of simple, generalizable scripts.

Recent presentation covering HPC and basic SLURM steps.

 Required Software

The Perceval cluster has a pre-built HISAT2 module that is up to date (at this writing).  Samtools is also available as a module.  However, the Picard java package and the StringTie executable must be downloaded and installed within the user’s account.  This can be done by creating a ~/bin directory, placing desired executables in that location, and adding it to your PATH.

For this workflow, we’ll take advantage of SLURM arrays.  A script is created with the appropriate commands with sbatch to specify resources required for each job.  The array feature uses a variable to specify a range of input files.  Given these components, SLURM will create a separate cluster job for each member of the array.  Using this feature you can write a single script file to execute as many jobs as you have samples.

On the Perceval cluster, use the /scratch/user space for all data and transfers within your jobs.  This means loading your raw fastq files here, saving a genome index file here, and saving output here.  The /scratch space is much faster than /home but it is not backed up so be sure to copy your data off the cluster when you’re done (or at least to your /home space).

One important note that is unique to the Picard cluster.  Modules must be loaded when you are logged into the headnode before submitting jobs to the cluster.  So, before you start, make sure you “load module HISAT2” and “load module samtools” — see the Perceval Quick Start Guide for more information on modules and versions.

 Steps

The trick to using the array feature of SLURM is to have a set of filenames with an easily-coded numeric value that can be referenced by the script.  My strategy is as follows:

    1.  Transfer fastq files to Perceval. From Windows this is simply accomplished using FileZilla.  From Linux or Mac, use rsync or scp to use an SSH-compatible file transfer.  Store all fastq files in a convenient location—I always use a directory named “fastq”.
    2. Extract the names of every fastq file into a text file. For example:
       find $(pwd) -name "*.fastq.gz" > fastq.txt

      The $(pwd) is used to store the full directory/file structure.  Not totally needed but often convenient.

    3. Transfer the contents of the fastq.txt file to Excel. This is a simple way to create a custom ln command for each file.  There’s lots of easy ways to do this (shell script, python)—choose your favorite.  See the attached Excel file and the resulting shell script for examples.  The result is a shell script with a single ln command in each line, one for each fastq file.
    4. Run the ln script from the directory where you want the links to be—the fastq folder for example. This creates a soft link for each fastq file with a numbered name.
    5. Now you’re ready to run HISAT2. See the enclosed sbatch script named “arrayHisat.sh” (below).  Important parts of the script:
      -J jobname-not required (default is name of script)
      -c 8 – this tells SLURM how many threads your program will run. It should match the –p value in the command line.
      -array=1-15 – this is the key element for setting up the array of commands.  In my case I want to run 15 individual commands, replacing the “$(SLURM_ARRAY_TASK_ID)” variable with each number in the range.
      -mail… – If you use these two commands the system will email you when the job is done.  I generally issue the execution command and then logout and wait for the email.  If you don’t want this just delete these two lines.

      #!/bin/bash
      
      #SBATCH -J hisat #jobname (optional) 
      #SBATCH -n 1 #nodes
      #SBATCH -c 8 #cpu threads per node
      #SBATCH -t 3:59:00 #estimated run time
      #SBATCH -p main #partition--just use main
      #SBATCH --export=ALL
      #SBATCH -o /home/user/output/hisat-%A-%a.out #where to put output if any
      #SBATCH -e /home/user/output/hisat-%A-%a.err #where to put error output
      #SBATCH --array=1-15  #range of numbers to use for the array.  In my case I have 15 samples.
      #SBATCH --mail-type=END,FAIL  #send an email on end or if failure
      #SBATCH --mail-user=user@rutgers.edu #where to send email
      
      hisat2 -t -p 8 --known-splicesite-infile /scratch/user/data/hg19-splicesites.txt -x /scratch/user/genomes/hisat2/hg19/genome -1  /scratch/user/data/lfastq/fastq"${SLURM_ARRAY_TASK_ID}"_R1.fastq.gz -2 /scratch/user/data/lfastq/fastq"${SLURM_ARRAY_TASK_ID}"_R2.fastq.gz -S /scratch/user/data/aligned"${SLURM_ARRAY_TASK_ID}".sam
    6. You may wish to edit the hisat2 command line to suit your preferences. Here’s the key features that I chose:
      -t – Report execution time in output.  Not needed for anything. I just like to know.
      -p 8 – Use 8 threads per node.  I haven’t tested more or less threads.  See the Perceval partitions to see how many CPUs per node you can use.
      –known-splicesite-infile – This is optional, but I think you’ll probably want to use this feature.  It allows you to load a text file containing every known splice site from the reference GTF file for your genome assembly.  As opposed to TopHat, using this reference splice map is useful whether you with to search for novel assemblies or not.  You create this file using the hisat2_extract_splice_sites.py python script that comes with HISAT2.  See the HISAT2 documentation.
      -x – location of the index files.  For most genomes there are pre-built indices on the HISAT2 website.  Download them to Perceval and store them in your scratch space.
      -1 and -2 point to the names of the paired-end fastq.gz files (R1 and R2, respectively).  Since we’ve previously created soft links varying only by the sample number, we can specify the full range of names using the “$(SLURM_ARRAY_TASK_ID)” variable.
      -S – location and name for the output SAM file created by HISAT2.
    7. If you want to use HISAT2 with Cufflinks instead of StringTie, add this option to your command:
      –dta-cufflinks
    8. Store the arrayHisat.sh script someplace convenient. Since I modify these scripts for every experiment I don’t put them in my ~/bin directory.  I keep a “scripts” directory just below my working directory for the project.
    9. Before running, remember to load HISAT2 local installation with the “module load HISAT2” command.  Then issue the command to send the script to SLURM:
       sbatch scripts/arrayHisat.sh
    10. You can monitor progress with:
       squeue –u <username>

      or

       squeue –j <jobnumber>
    11. When HISAT2 is complete for every sample, you should have one SAM file for each sample (or pair of fastq files).  These need to be converted to BAM files, sorted, and indexed.  Use the arraySamtools.sh script (below).  Similar in elements to the arrayHisat.sh script, SBATCH commands differ only in the job name and the elimination of the –c flag, because samtools is not multi-threaded.  Each of the three steps is run as a separate command inside the same script.  Again, array-numbered file names are used to keep each sample separate.  New to samtools 1.3.1 is the requirement to specify the temp file name.  This is important to keep individual sample temp files separate, so use the sample number (from the array variable) in each name.  Use sbatch to issue the script to SLURM and wait for the job to be completed.  When you’re satisfied that the sorted, indexed BAM files have been created you can delete the SAM files and the original, unsorted BAM files. Don’t forget to “module load samtools” before submitting this.
      #!/bin/bash
      
      #SBATCH -J samtools
      #SBATCH -n 1
      #SBATCH -t 3:59:00
      #SBATCH -p main
      #SBATCH --export=ALL
      #SBATCH -o /home/user/data/output/samtools-%A-%a.out
      #SBATCH -e /home/user/data/output/samtools-%A-%a.err
      #SBATCH --array=1-15
      #SBATCH --mail-type=END,FAIL
      #SBATCH --mail-user=user@rutgers.edu
      
      samtools view -b -S /scratch/user/data/results/aligned"${SLURM_ARRAY_TASK_ID}".sam > /scratch/user/data/results/aligned"${SLURM_ARRAY_TASK_ID}".bam
      samtools sort -T /scratch/user/xxx"${SLURM_ARRAY_TASK_ID}" -o /scratch/user/data/results/aligned"${SLURM_ARRAY_TASK_ID}".sorted.bam /scratch/user/data/results/aligned"${SLURM_ARRAY_TASK_ID}".bam 
      samtools index /scratch/user/data/results/aligned"${SLURM_ARRAY_TASK_ID}".sorted.bam
      
    12. In some cases, if there is a concern that the RNAseq library has been over-amplified, you can eliminate duplicate reads using the Picard java package. See the example arrayPicard.sh script.  Script elements are similar to previous ones.  Don’t forget to “module load java” if you haven’t already before running this script.  I run this step as a separate script because it is not always part of the standard pipeline, but it could easily be added to the previous script.
      #!/bin/bash
      
      #SBATCH -J picard
      #SBATCH -n 1
      #SBATCH -t 3:59:00
      #SBATCH -p main
      #SBATCH --export=ALL
      #SBATCH -o /home/user/data/picard-%A-%a.out
      #SBATCH -e /home/user/data/picard-%A-%a.err
      #SBATCH --array=1-15
      #SBATCH --mail-type=END,FAIL
      #SBATCH --mail-user=user@rutgers.edu
      
      java -Xmx4g -jar /home/user/bin/picard.jar MarkDuplicates INPUT=/scratch/user/data/aligned"${SLURM_ARRAY_TASK_ID}".sorted.bam OUTPUT=/scratch/user/data/aligned"${SLURM_ARRAY_TASK_ID}".uniq.bam METRICS_FILE=/scratch/user/data/aligned"${SLURM_ARRAY_TASK_ID}".metrics REMOVE_DUPLICATES=true ASSUME_SORTED=true
      
    13. Now we’re ready to assemble transcripts with StringTie. In this first instance of StringTie, each of the BAM files (in this case the sorted, duplicates-removed versions) is used, along with a reference GTF file, to assemble aligned reads into transcripts.  This is done whether you intend to find novel transcripts or not.  The arrayStringtie.sh script includes the standard form of the command—you may choose other flags as needed.  In this example, I’ve specified 8 CPUs per job (and this must match the SBATCH –c 8 command).  The output from this step is a GTF file derived from the sample-specific alignments.
      #!/bin/bash
      
      #SBATCH -J stringtie
      #SBATCH -n 1
      #SBATCH -c 8
      #SBATCH -t 3:59:00
      #SBATCH -p main
      #SBATCH --export=ALL
      #SBATCH -o /home/user/data/stringtie-%A-%a.out
      #SBATCH -e /home/user/data/stringtie-%A-%a.err
      #SBATCH --array=1-15
      #SBATCH --mail-type=END,FAIL
      #SBATCH --mail-user=user@rutgers.edu
      
      /home/user/bin/stringtie/stringtie /scratch/user/data/aligned"${SLURM_ARRAY_TASK_ID}".uniq.bam -o /scratch/user/data/aligned"${SLURM_ARRAY_TASK_ID}".uniq.gtf -p 8 -G /scratch/user/data/hg19-genes.gtf 
      
    14. Now we re-run StringTie with the –merge flag. This is only done once for each experiment, not once for each sample, so formally it doesn’t need to be run on the cluster, but it’s convenient to use the same computing power and file locations in a simple script.  My sample is named stMerge.sh.  Create a text file with the names of all the custom GTF files and use this as input (use the find command as we did for the fastq files above).  Output is a single, merged GTF file.
      #!/bin/bash
      
      #SBATCH -J stMerge
      #SBATCH -n 1
      #SBATCH -c 8
      #SBATCH -t 3:59:00
      #SBATCH -p main
      #SBATCH --export=ALL
      #SBATCH -o /home/user/data/stMerge-%A-%a.out
      #SBATCH -e /home/user/data/stMerge-%A-%a.err
      #SBATCH --mail-type=END,FAIL
      #SBATCH --mail-user=user@rutgers.edu
      
      /home/user/bin/stringtie/stringtie --merge -o /scratch/user/data/merged.uniq.gtf -p 8 -G /scratch/user/data/hg19-genes.gtf /scratch/user/data/gtfList.txt
      
    15. The third and final use of StringTie takes this merged GTF file and each BAM file to output quantity measurements. Since we plan to pick up these output files in Ballgown, we’ll send results to a directory named “ballgown” and place each sample’s output in a sub-directory of this.  To tell StringTie to do this step, use the –e and –B flags.  Again, other flags are available—see the StringTie documentation.  See my sample ebStringtie.sh script.
      #!/bin/bash
      
      #SBATCH -J ebStringtie
      #SBATCH -n 1
      #SBATCH -c 8
      #SBATCH -t 3:59:00
      #SBATCH -p main
      #SBATCH --export=ALL
      #SBATCH -o /home/user/data/ebStringtie-%A-%a.out
      #SBATCH -e /home/user/data/ebStringtie-%A-%a.err
      #SBATCH --array=1-15
      #SBATCH --mail-type=END,FAIL
      #SBATCH --mail-user=user@rutgers.edu
      
      
      /home/user/bin/stringtie/stringtie -e -B -p 8 -G /scratch/user/data/merged.uniq.gtf -o /scratch/user/data/aligned"${SLURM_ARRAY_TASK_ID}"/aligned"${SLURM_ARRAY_TASK_ID}".gtf /scratch/user/data/aligned"${SLURM_ARRAY_TASK_ID}".uniq.bam 
      
    16. When this step is complete you can use FileZilla to transfer the entire ballgown directory back to your desktop. Fire up R and load the ballgown package.
    17. Alternatively, you can extract count data and proceed directly to edgeR or DESeq2.  Use the prepDE.py script provided on the StringTie web page.  I’ve successfully loaded the output from this script into DESeq2 for differential analysis and modeling.