Blog

R Script Seurat with a Singularity Container Using SLURM

containerA previous post provide a step-by-step example for setting up a singularity container for use on the HPC (in my case, Perceval).  Here we’ll see how to build a more complex singularity recipe, create a distributable container, and use it to run a few steps of Seurat as an Rscript batch file. This approach allows you to run any version of R and its packages that you need on the HPC in a secure container and also to break free of the limitations of running R on your desktop.

Permissions

A key feature of singularity is the control of user-level access and security.  This means that you don’t need to have any root privileges on the HPC.  However, you do need to be root to build your container on a local machine. To summarize:

  • All steps to build the container using the recipe file and any manual additions need to be run as root (sudo) on your local system.
  • Running the container on the HPC cannot be done as root.  You will use only your assigned username permissions.

The Container Recipe

I downloaded a few sample recipe files from the Singularity Hub, including one that did a nice job of setting up for running R or Rscript.  You can download my recipe file from this link.  I’ll highlight some of the sections to explain how each works.

Bootstrap: docker
From: ubuntu:16.04
IncludeCmd: yes

Use the Docker repository to download Ubuntu Linux v. 16.04 (Xenial), which is quite stable.

Next is the %environment section which I won’t cover here–some of the lines I just copied as is from the downloaded sample file.

The %labels section seems to be somewhat arbitrary but indicates labels useful for tracking your versions, at least.

Next is %apprun and %runscript.  After you’ve built the container, these describe your desired behaviors when you run “singularity run” (run the %runscript command) or “singularity run –app Rscript” (run the %apprun Rscript command).  Here’s the lines from the file:

%apprun R
exec R "$@"

%apprun Rscript
exec Rscript "$@"

%runscript
exec R "$@"

In each of these the “$@” means to append this command with arguments passed from the singularity run command.  More about that later.

The final section is the more important and useful one.  The %post section is run after you’ve bootstrapped the container with Ubuntu loaded.  Here you issue standard Ubuntu commands to load all the packages you’ll need.  First the system packages:

%post
apt-get update
apt-get install -y apt-transport-https apt-utils software-properties-common

#add CRAN/Ubuntu repo, add key, then refresh
apt-add-repository "deb https://cloud.r-project.org/bin/linux/ubuntu xenial/"
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 51716619E084DAB9
apt-get update

apt-get install -y wget nano
apt-get install -y libblas3 libblas-dev liblapack-dev liblapack3 curl
apt-get install -y gcc fort77 aptitude
aptitude install -y g++
aptitude install -y xorg-dev
aptitude install -y libreadline-dev
aptitude install -y gfortran
gfortran --version
apt-get install -y libssl-dev libxml2-dev libpcre3-dev liblzma-dev libbz2-dev libcurl4-openssl-dev 
apt-get install -y libhdf5-dev hdf5-helpers libmariadb-client-lgpl-dev

apt-get install -y r-base r-base-dev
R --version

# installing packages from cran
R --slave -e 'install.packages("devtools",repos="https://cran.rstudio.com/")'
R --slave -e 'install.packages("dplyr",repos="https://cran.rstudio.com/")'
R --slave -e 'install.packages("rhdr5",repos="https://cran.rstudio.com/")'
R --slave -e 'install.packages("Seurat",repos="https://cran.rstudio.com/")'

# installing from bioc
R --slave -e 'source("https://bioconductor.org/biocLite.R"); biocLite("pachterlab/sleuth")'
R --slave -e 'source("https://bioconductor.org/biocLite.R"); biocLite("cummeRbund")'

# installing from 10xgenomics repo
R --slave -e 'source("http://cf.10xgenomics.com/supp/cell-exp/rkit-install-2.0.0.R") '

This took me some work to figure out what steps were required and the correct order of execution.

  1. Start by refreshing the apt-get data from the pre-installed repositories.
  2. Because we will add a new repository with a secure connection (https), we need to grab the secure transport tool.  Also, we’ll need to install some utils and common tools to be able to use command-line to add repositories and keys.
  3. Once those are installed, we can add the CRAN/Ubuntu repository using a command (instead of editing it into the /etc/apt/sources.list system file).  After that, add the public key for this repository.  Then refresh the repositories.
  4. Next we need several package that are required for R installation (which requires compilation) and some extraneous packages required by the specific R packages that we’ll load later.  The wget and nano packages are included for convenience but aren’t really needed.  The gfortran –version is just a safety check that fortan got installed correctly–it would throw an error if the install didn’t work.  We will need the hdf5 and mariadb packages for Seurat and cummeRbund, respectively.  If you don’t want all those packages you can delete some of these.
  5. The last apt-get installs R.
  6. Next we’ll use several steps to install packages in R.  Using –slave eliminates most unneeded output.  The -e flag says to execute the string in quotes from the command-line.  I specified which repos to use just in case.
  7. After some cran installs, I switch to biocLite() from bioconductor for some packages. Note that you need to source (download) biocLite() each time you run an R –slave command.
  8. Finally, I use the 10XGenomics repository to install their cellrangerRkit package.

With this recipe, you can first create a sandbox to test it (if you wish), using this command on a local Linux system:

sudo singularity build --sandbox biocBox/ Singularity.bioconductor-make.txt

This takes a while to run on my Linux desktop, maybe ~30 minutes.  When it’s done you can sudo singularity shell into the sandbox and check it.  If everything works, go ahead and convert to a non-writable image file:

sudo singularity build bioconductorBox.simg biocBox/

This produced an image file that was only 569M!  Transfer this to your space on the HPC.  I used a “containers” folder under my /scratch space:

scp bioconductorBox.simg user@perceval.rutgers.edu:/scratch/user/containers/.

Rscript file

We’ll need a file containing R commands that can run non-interactively.  That is, if you’re used to working in RStudio (as I am) you need to consider every step to make sure it will work without you typing anything.  The most important part is to capture and explicitly save any graphics output into files.

For this example, I’ll take commands used from the blog post using Seurat with single samples.  The Rscript file can be downloaded here.

The most important and possibly confusing part is where I specify the working directory.  Since I plan to mount my /scratch/user under the /mnt mountpoint, the location that is actually /scratch/user/sample1 needs to be called as /mnt/sample1 when running inside the container.

I played around with using ggplot() for the ggplot2 objects but I found that the more basic pdf(), followed by the plot command, followed by dev.off() works more consistently.

Any text output will be collected into files that will be specified in my SLURM script.

At the end, I explicitly save the workspace into a standard .RData file.  This allows me to download the file to my desktop, open it with RStudio, and continue data exploration.  All the data are loaded, normalized and scaled.  When multiple samples are combined, you should consider also deleting (rm()) all the original single-sample objects to make a smaller workspace.

SLURM Script

Here comes the best part–running the container with your data.  It’s possible to use singularity script or run from command-line if you request a compute node–this allows you to type commands as you go.  But this post is about batch processing.  The final SLURM script can be downloaded here.

Because we’ll be running Seurat, which is currently single-threaded, we only ask for one node, one task, but a good amount (64 Gb) of RAM.  Remember to give the script a reasonable execution time estimate.  The STDOUT and STDERR output are redirected to files for later review.  Finally, it’s not necessary but I like to get an email when it’s done.

The first step is to manually load the singularity module.  As before, the module is hidden from the module spider command because it’s currently considered to be “under development” on our system.  This is designated by the dot in front of the version.

The srun command calls singularity with the run option, but then specifies which –app to run.  Remember that the recipe file gives us three behaviors, two run R (either run or run –app R) and one runs Rscript.

At this point we can also mount the filespace we plan to use with a –bind command.  This allows us to use /mnt to find all the files under /scratch/user.

Next is the name of the container to run.

Last is the argument we want to pass to the exec Rscript “$@” command.  By the time this command is run, we’re working inside the container so we need the address after mounting the /scratch space, where I put the R script file in my containers folder.  This took me several tries to figure out.

Here’s the full srun command:

srun singularity run --app Rscript --bind /scratch/user:/mnt bioconductorBox.simg /mnt/containers/autoSeurat.R

Run it!

Once the data are loaded (in my example, under /scratch/user/sample1), the container image, the R script file, and the SLURM script are loaded (under /scratch/user/containers), you’re ready to go.  Be sure to create an output folder under containers to catch the STDOUT and STDERR files.

sbatch singBio.sh

You should get an email saying that it ran without errors.  Your STDOUT and STDERR output are in the containers/output folder.  The R script saved files are all in the working directory you specified in the R script file (/scratch/user/sample1/outs/filtered_gene_bc_matrices/mm10).

Advertisements

Singularity to run Seurat

Based on my previous posts about using Seurat for single-cell RNAseq data (single sample or two samples), it started to become clear to me that many people will have trouble with their computing resources.  My desktop is Windows 10 with 64 Gb of RAM and I was reaching my limits (with a few other programs running in background) when I tried to combine four 10XGenomics datasets!

It seemed reasonable to use our HPC (Perceval or Amarel at Rutgers, run by OARC) environment to take advantage of a much more robust hardware environment.  I quickly learned that there was no way to install all of the required system libraries or even R packages that I needed.  I wish I had a fully-customizable “desktop-like” environment on my HPC!

Naturally, I’m not the only person who had this thought.  The HPC staff recommended Singularity, which is a “container” technology designed exactly for my needs–to run computationally-intensive jobs on a HPC while keeping my custom installation environment isolated from the system.  Singularity has the unique property of maintaining user identity and security.  But it provides an environment which acts like a combined version of your Linux desktop and the HPC system.

I am only a beginner to the use of Singularity.  There’s much more to it than I’m going to describe.  This description will be limited to how I got Singularity to allow me to run R and Seurat on a single compute node on my cluster.  You can also submit batch jobs using SLURM and use it for many other customized workflows, but I’ll skip all that for now.

Install Singularity on a local computer

Following the “Quick Start” directions on the Sylabs.io site, I used git to clone singularity into a new directory on my Ubuntu 16.04 desktop.  Follow the Quick-Start directions to make and install.  You’ll need sudo permission to install into a local system location.

Build a Sandbox Container

The first step is to build an empty container and “install” an OS. I’m going to use Ubuntu, since this matches my local Linux computer’s system.  After trying a few things and reviewing the documentation, I chose to create a very basic recipe file first.  Here’s my initial recipe file:

Bootstrap: docker
From: ubuntu:16.04

%runscript
echo "This is what happens when you run the container.."

%post
echo "Hello from inside the container"
apt-get -y update
apt-get -y install vim

%environment
export LC_ALL=C
export PATH=$PATH

All this was saved into a file named “Singularity” in a new directory that I named “containers“.  My instructions say to pull Ubuntu 16.04 from the docker repository and then to give me a message (“Hello from inside the container”) and run some apt-get command (just to prove to myself that this works).  I copied some environment settings from an example I found–I still need to research these and perhaps modify them.

To start, call singularity as root and tell it you want a writable “sandbox” container, which is implemented as a new subdirectory named “myubuntu/“.

sudo singularity build --sandbox myubuntu/ Singularity

Once this runs, you should have a new subdirectory in your containers folder.  You can start an interactive shell in this new environment to install anything you want.  Make sure to start it using sudo and use the –writable flag (if you leave this out, you can “test-drive” the shell but anything you do won’t be saved).

sudo singularity shell --writable myubuntu/

After giving your sudo password, you should see:

Singularity: Invoking an interactive shell within container...
Singularity myubuntu:~>

Since you pre-installed vim, you should be able to run it now from command-line.  You can now use apt-get to install anything you want to use within your container.

Fix your Recipe

Before moving on to R, we’ll need to install several libraries we’re going to need.  One of these is for installing R from the secure CRAN repository, others are needed for Seurat prerequisites.  You can run these lines manually from the shell command-prompt:

apt-get -y install libssl-dev
apt-get -y install libcurl4-openssl-dev
apt-get -y install libhdf5-dev
apt-get -y install apt-transport-https

Or, if you prefer, you can add these commands to your Singularity recipe file (after the vim install command).  Just delete your myubuntu container, and re-run the sudo singularity build command with the new recipe file.

Installing R Inside Your Container

This gets a little tricky, since the Ubuntu 16.04 default r-base package is too old to work with Seurat.  To use a CRAN/Ubuntu repository for installation, we need to do some customization.  First, edit your /etc/apt/sources.list file (meaning the file inside your singularity container–make sure you have already used the “sudo singularity shell…” command from above).  Open it with the nano editor and add this line at the end:

deb https://cloud.r-project.org/bin/linux/ubuntu xenial/

Before you can use this repository, however, you need to download and install the public key for this repository.  I found that there’s lots of websites listing the older, outdated key; here’s the correct command that works:

apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 51716619E084DAB9

If that seems to work (no error messages) you’re ready to start.  Update your install repositories:

apt-get update

And then install R:

apt-get install r-base r-base-dev

This’ll take a while.  When it’s done, you should be able to open R:

Singularity myubuntu:~> R

R version 3.4.4 (2018-03-15) -- "Someone to Lean On"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

>

Install R Packages

Just to be sure everything is working, start with an easy one.

install.packages("dplyr")

You should be able to install dplyr into your local user library without any problem.  If everything works correctly, go on to the big one:

install.packages("Seurat")

This will run for a while.  There’s a lot of dependencies and everything needs to be downloaded and compiled.  If everything is working, you will be able to load the library:

library(Seurat)

If that works, you’re ready to finalize your container.

Convert your Container

Quit R (“q()”, save history if you like) and leave your container’s shell (control-d).  Now you’re back at the prompt for your local Linux system.  Convert your sandbox to a single file.

sudo singularity build --writable production.simg myubuntu/

Doing this maintains your sandbox (myubuntu/) but creates a new ext3 image file named production.simg — mine is ~1.5Gb. If you remove the –writable flag from this command, you get a smaller, squashfs file that works but cannot be modified.  Once you’re done testing everything this is a good way to make a static, working container.

Move this file to your home space on the HPC:

scp production.simg username@perceval.rutgers.edu:/home/username/.

Singularity on Perceval

Next, ssh into your account on Perceval.  Singularity is installed as a module but it does not show up on the module spider command list because it is still in “test” mode and they have it hidden.  Load it like this:

module load singularity/.2.5.1

The dot in front of the version number indicates that it’s hidden.  Now, before starting to run your container, request a node from the cluster so you’re not using compute resources on the headnode:

srun --partition=main --nodes=1 --ntasks=1 --cpus-per-task=1 --mem=8g --time=00:30:00 --export=ALL --pty bash -i

It may take a minute until you’re assigned to an available node, but when you do you’ll see a change in your prompt similar to this:

[username@node073 ~] $

Start up your container:

singularity shell production.simg

You’ll see the same prompt as before.  You can now start R and load library(Seurat) as before on your desktop!

Dynamically Adding Mount Points

One issue I noticed was that I didn’t have access to the /scratch/username volume in my container.  With some input from Josh and some trial-and-error, I found the solution.  You can specify additional mount when you invoke the container but they must be bound to an existing mount-point.  I realized that my barebones Ubuntu already had a /mnt in the directory tree and it was empty.  So here’s the new command:

singularity shell --bind /scratch/username:/mnt production.simg

Now your /scratch/username space is found at /mnt.  Easy!

A Smaller Production Container

Once you’ve got everything built and you’ve tested your container (made with the –writable flag), you can make a smaller version of the same container for production.  For example, go back to your local system (where you built the myubuntu/ sandbox) and do this:

sudo singularity build final.simg myubuntu/

The file size for this is about half of the writable one.  Copy it to the HPC system and run it or start a shell as before.  The only difference is that you can’t modify this container.

Headnode Error

If you try to start your singularity container on the HPC and see this error:

ERROR : Failed to resolve path to /var/singularity/mnt/container
ABORT : Retval = 255

You’re probably trying to run from the headnode, which isn’t allowed.  Use the srun command to gain access to a compute node.

Notes

  1. Notice the settings on the srun command above for requesting resources.  To test the container, I specified only 1 node, 1 CPU and 8g RAM, as well as setting a time limit of 30 minutes.  Seurat is single-threaded so there’s probably no reason to set more CPUs, but certainly you’ll want to increase the RAM and maybe the time.  Asking for more resources may delay granting your request.
  2. This example allows for interactive shell usage of R only.  Once you set up an R script file with all your commands, there’s no reason why you can’t automate the process with an Rscript file.R command.  You can even put that in your recipe file and use “singularity run” instead of “shell” (put it under %runscript).  But that’s for another post…

 

 

 

 

 

Seurat Chapter 2: Two Samples

We’ve already seen how to load data into a Seurat object and explore sub-populations of cells within a sample, but often we’ll want to compare two samples, such as drug-treated vs. control.  In this example we’ll use one sample made from a proliferating neuronal precursor cells (“Prolif”) and one that’s been differentiated into post-mitotic neurons (“Neurons”).  A key aspect of doing this with single-cell RNAseq is that we won’t assume that either population is uniform.

As with previous posts, much of this follows vignettes posted on the Seurat website very closely.

Computing Resources

Before beginning, we should consider that these analyses require substantial computing resources.  I use a Windows 10 desktop with 10 physical processors (20 logical processors; although most R functions are not multi-threaded by default) and 64 Gb of RAM.  Some of the steps here took more than 10 minutes to run on this system.  Saving the final data environment to an .RData file can take a while.  So think about where to run this kind of work before you start on an old Mac laptop!

Load Data and Create Objects

As with the single-sample example, the steps are to load the 10X Genomics cellranger output into a data object, create a Seurat object, add metadata, filter, normalize, and scale.  But before combining two objects, we need to add a sample-specific identifier to each UMI.

First, load the libraries and the data:

library(Seurat)
library(dplyr)

#create samples objects for samples 2 and 4 
s2.data=Read10X(data.dir="../../sample2/outs/filtered_gene_bc_matrices/mm10/")
s4.data=Read10X(data.dir="../../sample4/outs/filtered_gene_bc_matrices/mm10/")

Seurat provides a function “RenameCells” but I could never get that to work as expected.  So I found a simple trick to use standard R functions (paste) to add a sample-specific string to each UMI string.

#prepend cell IDs (UMIs) with sample identifier to distinguish samples after merging
colnames(x = s2.data) <- paste('s2', colnames(x = s2.data), sep = '_')
colnames(x = s4.data) <- paste('s4', colnames(x = s4.data), sep = '_')

Now each UMI (for example, “AAACCTGAGCGAGAAA,” which is likely found in every sample) is replaced by one that distinguishes cells from each sample (such as “s2_AAACCTGAGCGAGAAA“).

Now we’re ready to create and pre-process our two Seurat objects:

#for each, create object, add metadata, filter, normalize, and scale
s2=CreateSeuratObject(raw.data=s2.data,project="iMOP",min.cells=5)
s2@meta.data$group="Neurons"
s2=FilterCells(s2,subset.names="nGene",low.thresholds = 500,high.thresholds = Inf)
s2=NormalizeData(s2)
s2=ScaleData(s2,display.progress = T)

s4=CreateSeuratObject(raw.data=s4.data,project="iMOP",min.cells=5)
s4@meta.data$group="Prolif"
s4=FilterCells(s4,subset.names="nGene",low.thresholds = 500,high.thresholds = Inf)
s4=NormalizeData(s4)
s4=ScaleData(s4,display.progress = T)

Combine Samples with CCA

Before proceeding to a canonical correlation analysis (CCA, which also combines the two samples), let’s find the 1000 genes from each sample with the highest dispersion.  Then we’ll combine the two lists and confirm that they are found in both samples.

#select variable genes common to both samples
s2=FindVariableGenes(s2,do.plot=T)
s4=FindVariableGenes(s4,do.plot=T)
g.2=head(rownames(s2@hvg.info),1000)
g.4=head(rownames(s4@hvg.info),1000)
genes.use=unique(c(g.2,g.4))
genes.use=intersect(genes.use,rownames(s2@scale.data))
genes.use=intersect(genes.use,rownames(s4@scale.data))

We can now use RunCCA to combine the two samples and also to identify common sources of variation between the two datasets.

#do CCA
agg=RunCCA(s2,s4,genes.use=genes.use,num.cc=30)

To visualize the overlap of the two samples in CCA space and also to check distribution of expression signals, we can create two diagnostic plots:

p1=DimPlot(object=agg,reduction.use="cca",group.by="group",pt.size=0.5,do.return=T)
p2=VlnPlot(object=agg,features.plot="CC1",group.by="group",do.return=T)
plot_grid(p1,p2)
CCA plots
CCA space plot and violin plot of abundance per sample.

The PrintDim function outputs the top distinguishing genes in each CCA dimension.

PrintDim(object=agg,reduction.type="cca",dims.print=1:2,genes.print=10)

Finally, this plot shows the smoothed shared correlation strength versus CCA dimension number, to evaluate how many dimensions are useful.

MetageneBicorPlot(agg,grouping.var="group",dims.eval=1:30,display.progress=T)

MetageneBicorplot

A heatmap is plotted to associate the most variable genes with each cluster.  For this example, I only plotted the first 9 CCs.

DimHeatmap(object=agg,reduction.type="cca",cells.use=500,dim.use=1:9,do.balanced=T)

 

With this you can now align the data to the CCA subspace–choose the number of CC dimensions that make sense for your sample.  Note that each of these dimension reduction steps produces a new set of data under the @dr slot, so you can refer to this for clustering.  After this point, you will have both “cca” and “cca.aligned” under this slot.

agg=AlignSubspace(agg,reduction.type="cca",grouping.var="group",dims.align=1:16)

Let’s plot distributions, as violin plots, for each of the first two CC dimensions.

p1=VlnPlot(object=agg,features.plot="ACC1",group.by="group",do.return=T)
p2=VlnPlot(object=agg,features.plot="ACC2",group.by="group",do.return=T)
plot_grid(p1,p2)

aligned CCA first two dims

tSNE

One visualization method is to project the data into tSNE space.  We can also use the cca.aligned data to find clusters.  Note that you can specify how many CC dimensions to use for clustering and also specify the “resolution.”  A resolution greater than one favors more clusters, less than one favors fewer clusters.  Plot the tSNE space showing the sample identifier (“group”) and the clusters.

agg=RunTSNE(agg,reduction.use="cca.aligned",dims.use=1:16,do.fast=T)
agg=FindClusters(agg,reduction.type="cca.aligned",resolution=0.6,dims.use=1:16)
p1=TSNEPlot(agg,do.return=T,pt.size=0.5,group.by="group")
p2=TSNEPlot(agg,do.label=T,do.return=T,pt.size=0.5)
plot_grid(p1,p2)

tSNE sample vs clusters

Interestingly, with this dataset, tSNE did not turn out to separate the proliferating cells well from the neurons.  There’s also a new @dr dataset named “tsne”.  There’s 8 clusters and some clear overlap with samples, but it’s kind of a mess.

Principal Components Analysis

So next I tried principal components.  For this I used only the @var.genes slot of the combined object, which has fewer genes than the genes.use list created above.  I ask for a list of 5 distinguishing genes for each of the first 5 principal components.

agg=RunPCA(agg,pc.genes=agg@var.genes,do.print=T,pcs.print = 1:5,genes.print = 5)

Here’s the output:

[1] "PC1"
[1] "Malat1" "Cst3"   "mt-Co1" "mt-Co3" "Itm2b" 
[1] ""
[1] "Eif5a" "Lyar"  "H2afz" "Ncl"   "Ran"  
[1] ""
[1] ""
[1] "PC2"
[1] "Rpl32"  "Rps5"   "Rps4x"  "Eef1a1" "Rps7"  
[1] ""
[1] "Dbi"     "Lgals1"  "Adh7"    "Igfbp2"  "mt-Atp6"
[1] ""
[1] ""
[1] "PC3"
[1] "Phgdh" "Aldoc" "Eef1d" "Ptn"   "Eif1" 
[1] ""
[1] "mt-Nd1"  "Sox11"   "Elavl3"  "mt-Nd2"  "mt-Atp6"
[1] ""
[1] ""
[1] "PC4"
[1] "Rps27a" "Rpl37"  "Rps23"  "Rpl32"  "Rpl26" 
[1] ""
[1] "Tubb3"  "Rtn1"   "Elavl3" "Stmn1"  "Tuba1a"
[1] ""
[1] ""
[1] "PC5"
[1] "Tubb3"  "Tuba1a" "Stmn2"  "Elavl3" "Calm2" 
[1] ""
[1] "Ybx3"     "Mtdh"     "mt-Nd2"   "Kcnq1ot1" "Rplp1"   
[1] ""
[1] ""

This looks promising (based on the genes).  Note that it also adds another dataset in the @dr slot named “pca”.  Try more visualizations.

#various ways to show output from PCA
VizPCA(agg,pcs.use = 1:2) #plots component of each of top genes per PC

vizpca

PCAPlot(agg,dim.1=1,dim.2=2) #all cells plotted on first two PCs

pca plot clusters

 PCAPlot(agg,dim.1=1,dim.2=2,group.by="group") #show source samples

pca plot groups

This shows a very clear distinction between the starting proliferating cells and the resulting neurons.  That’s the kind of display I was looking for.  Let’s see which genes distinguish a few PCs.

PCHeatmap(agg,pc.use=1,cells.use=500,do.balanced = T,label.columns = F) #first PC only

PC1 heatmap

#try running more PCs to visualize how many explain variance
PCHeatmap(agg,pc.use=1:6,cells.use = 500,do.balanced = T,label.columns = F,use.full=F)

PC heatmap 1-6

Looks good.  Next, project all the data onto PC space for differential expression analysis.

agg=ProjectPCA(agg,do.print = F)Differential Expression by Sample

Differential Expression by Sample

Before re-clustering in PCA space, let’s get lists of genes that are differentially expressed by input sample.  To do this, we’ll overwrite the @ident slot (which contains the cluster identities from the first clustering attempt) with sample group names (from the metadata).

agg=SetAllIdent(agg,id="group")
cell.markers=FindMarkers(agg,ident.1="Neurons",ident.2="Prolif",test.use="wilcox")

With this list of DE genes, we can also visualize results as though we had standard RNAseq samples, by averaging the cells within a group and plotting a scatterplot. The authors of Seurat posted a few nice functions for adding labels to a few gene dots on this plot, which you can download from this page.  I stored the R code for the functions in a separate file, named immune_alignment_functions.R, which I source to load the functions.

prolif.mrkrs=rownames(head(cell.markers[order(-cell.markers$pct.2 + cell.markers$pct.1),],5))
neu.mrkrs=rownames(head(cell.markers[order(cell.markers$pct.2 - cell.markers$pct.1),],5))

#create averaged data model
avg.cells=log1p(AverageExpression(agg,show.progress = F))
avg.cells$gene=rownames(avg.cells)

#load scripts for labeling scatterplot
source('immune_alignment_functions.R')

#plot averaged data highlighting greatest differences
p1=ggplot(avg.cells,aes(Neurons,Prolif))+geom_point(size=.75)
p1=LabelUR(p1,genes=neu.mrkrs,avg.cells,adj.u.t=.3,adj.u.s=.23)
p1=LabelUL(p1,genes=prolif.mrkrs,avg.cells,adj.u.t=.5,adj.u.s=.4,adj.l.t=.25,adj.l.s=.25)
plot(p1)

DE scatterplot

You can also view the top DE genes as an array of PCA space plots:

FeaturePlot(agg,features.plot = c(neu.mrkrs,prolif.mrkrs),cols.use=c("grey","blue"),reduction.use="pca")

DE genes PCA plots

This shows a nice partition of neuron markers extending to the left and proliferative cell markers to the right.

JackStraw PCA

Seurat includes a more robust function for finding statistically significant PCs through the jackStraw algorithm.  Try this an plot output.

agg=JackStraw(agg,num.replicate=100,display.progress = T)
JackStrawPlot(agg,PCs=1:18) #to find how many are significant
PCElbowPlot(agg) #another, simpler way to visualize

pc elbow plot

For me, the elbow plot is the most useful.  It seems to say that the first 9 or 10 PCs really capture the majority of the variance.

Clustering and DE in PCA Space

So using only PCs 1:9, let’s try clustering in PCA space.  I set the k to 2 (intending to find clusters focused on my two samples) and very low resolution of 0.1 to push towards fewer clusters at this point.

agg=FindClusters(agg,genes.use=agg@var.genes,reduction.type="pca",dims.use=1:9,k.param=4,save.SNN=T,force.recalc=T,plot.SNN=F,resolution=.1)
p1=PCAPlot(agg,dim.1=1,dim.2=2,group.by="group")
p2=PCAPlot(agg,dim.1=1,dim.2=2,group.by="ident")
plot_grid(p1,p2)

pca cluster plots

Looks good–there’s a clear distinction between neurons and proliferating cells but a separation within each group.  Let’s find which genes distinguish all four clusters.

all.markers=FindAllMarkers(agg,only.pos=T,min.pct=0.25,thresh.use=0.25)
all.markers %>% group_by(cluster) %>% top_n(5,avg_logFC)

This is the output showing the top 5 genes per cluster:

# A tibble: 20 x 7
# Groups:   cluster [4]
   p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene  
   <dbl>     <dbl> <dbl> <dbl>     <dbl> <fct>   <chr> 
 1     0      2.47 0.986 0.514         0 0       Cst3  
 2     0      2.04 0.864 0.246         0 0       Itm2b 
 3     0      2.02 1     0.911         0 0       Malat1
 4     0      1.79 0.787 0.249         0 0       Ramp1 
 5     0      1.73 0.857 0.484         0 0       Cd81  
 6     0      1.30 0.998 0.617         0 1       Eif5a 
 7     0      1.28 0.779 0.223         0 1       Vps8  
 8     0      1.26 0.925 0.209         0 1       Ddx21 
 9     0      1.21 0.977 0.416         0 1       Tomm5 
10     0      1.20 0.936 0.264         0 1       Srm   
11     0      1.52 0.608 0.088         0 2       Tubb3 
12     0      1.18 0.878 0.474         0 2       Sox11 
13     0      1.13 0.755 0.325         0 2       Stmn1 
14     0      1.10 0.398 0.02          0 2       Elavl3
15     0      1.04 0.664 0.345         0 2       Mllt11
16     0      1.16 0.961 0.602         0 3       Eif2s2
17     0      1.13 0.991 0.773         0 3       Eif4a1
18     0      1.12 0.979 0.595         0 3       Ldha  
19     0      1.11 1     0.922         0 3       Npm1  
20     0      1.05 0.731 0.346         0 3       Ppa1  

Make a table of the top 2 genes per cluster and plot dots showing which genes best characterize which cluster, split by sample group.

top2=all.markers %>% group_by(cluster) %>% top_n(2,avg_logFC)
SplitDotPlotGG(agg,genes.plot=as.character(top2$gene),cols.use = c("blue","red"),x.lab.rot = T,plot.legend = T,dot.scale = 8,do.return = T,grouping.var = "group")

splitdotplot pca clusters

FeaturePlot(agg,features.plot = as.character(top2$gene),cols.use=c("grey","blue"),reduction.use="pca")

pca feature top2 genes per cluster

I think that clusters 1 and 3 best represents the proliferating cells, so let’s re-draw the scatterplot, labeling these top cluster genes appropriately (left/right labeling).

p1=ggplot(avg.cells,aes(Neurons,Prolif))+geom_point(size=.75)
p1=LabelUR(p1,genes=subset(top2,cluster==0 | cluster==2 )$gene,avg.cells,adj.u.t=.3,adj.u.s=.23)
p1=LabelUL(p1,genes=subset(top2,cluster==1 | cluster==3 )$gene,avg.cells,adj.u.t=.5,adj.u.s=.4,adj.l.t=.25,adj.l.s=.25)
plot(p1)

scatter pca clusters top2

There’s more to do–you can dump the all.markers object to a file.  We can consider more clusters by increasing the resolution value in the FindAllMarkers step.  Let the biology guide you!

 

Seurat Chapter 1: Analyzing Single Samples

As I’ve learned more about the power of Seurat, I think it’ll be clearest if I split posts into three examples:

  1. Analyzing a single sample
  2. Combining and analyzing two samples
  3. Analyzing multiple (>2) samples

Each has a slightly novel way of dealing with the data and each builds on the previous example.

Single Sample

Based on my earlier post to run raw 10X Genomics sequencing output (fastq files) on a cluster to count transcripts and interpret barcodes (UMIs), this post will start with the standard directory and file structure output by the cellranger count command.

You should already have installed R and RStudio.  Install Seurat using the RStudio Packages pane.  Click “Install” and start typing “Seurat.”  The Seurat version available in CRAN should be v.2.3.3 and should load automatically along with any other required packages.

In RStudio, use the Files pane to find a convenient location for your working files and output.  Choose the “More/Set as working directory” command.  For all of the following example commands, you can also download my R script file.

Load Seurat and dplyr packages into the workspace:

library(Seurat)
library(dplyr)

For my example, I’m going to rely heavily on a vignette from the Seurat authors.  That vignette does a nice job explaining the algorithms behind each step but I’m going to focus only on the procedure and outcomes.

Load Data

To load your sample, determine the location of the directory named “filtered_gene_bc_matrices.”  Under that should be a folder named with your reference genome–in my case it’s “mm10”.  Using this location (relative to the current working directory–my working directory is adjacent to the sample directory), read the 10X Genomics output into an object.

s1.data=Read10X(data.dir="../sample1/outs/filtered_gene_bc_matrices/mm10/")

Next, create a Seurat data model from this raw data.  Seurat wants a project name (I used “iMOP”) and a filter to include only genes expressed in a minimum number of cells, here I chose 5 cells.  There are many more options you can add at this stage but for now we’ll take our analysis stepwise through normalization and scaling to see how this works.

s1=CreateSeuratObject(raw.data=s1.data,project="iMOP",min.cells=5)

Count mitochondrial genes expressed

Both as a QC step and for scaling (below), let’s count the number of mitochondrial genes we saw per cell.  First we’ll get a list of gene symbols (for mouse, all start with “mt-“).

mito.genes=grep("^mt-",rownames(x=s1.data),value=T)

I found 13 of the 37 mitochondrial genes in my sample, so this produces a vector of those 13 gene symbols.  Use the summed counts of these genes per barcode, divided by the total numbers of counts for all genes, to get percent mitochondrial for each cell.

percent.mito=Matrix::colSums(s1@raw.data[mito.genes,])/Matrix::colSums(s1@raw.data)

To get a sense of the distribution of values, use the summary command:

summary(percent.mito)
Min.     1st Qu.  Median   Mean     3rd Qu.  Max.
0.005147 0.044402 0.054328 0.057863 0.066207 0.733269

My sample has a reasonably low overall rate (mean & median ~5%) but a few cells with a high rate (max 73%).

Seurat has a convenient slot named metadata that you can use to store things like this.  This slot will come up again later when we add more samples in future posts.

s1=AddMetaData(s1,metadata = percent.mito,col.name="percent.mito")

Visualize cells and genes as distributions

You can now visualize the overall distributions of detected genes (“nGene”), numbers of cells (as UMI barcodes; “nUMI”), and the percent mitochondrial genes per cell, in the form of jittered dots overlaid on a violin plot:

VlnPlot(s1,features.plot = c("nGene","nUMI","percent.mito"),nCol=3)
Violin plot showing distributions of genes, cells, and percent mitochondrial genes per cell.

Note that the scales for each plot are different.

Relationship between the numbers of cells and (left) the percent mitochondrial genes per cell and (right) the number of detected genes per cell.

 

Here’s a plot (with two panels) to show the relationships between the number of cells (nUMI) and the (left) percent mitochondrial genes (from the percent.mito slot we added to the metadata) and (right) the total number of genes (nGene).

par(mfrow=c(1,2))
GenePlot(s1,gene1="nUMI",gene2="percent.mito")
GenePlot(s1,gene1="nUMI",gene2="nGene")

Filter, normalize, and scale

Now let’s clean up the data with filtering and normalization. Remove cells with low gene counts (here, 500).  You can also choose to exclude cells with unusually high counts, or, as I’ve done here, set the threshold to infinity.

s1=FilterCells(s1,subset.names="nGene",low.thresholds = 500,high.thresholds = Inf)

Normalize (using default parameters):

s1=NormalizeData(s1,normalization.method = "LogNormalize", scale.factor = 10000)

Before scaling, let’s find the genes with the greatest variance by cell:

s1=FindVariableGenes(s1,do.plot=T)
Variable genes dispersion plot

Then scale, using the percent.mito as part of the regression:

s1=ScaleData(s1,vars.to.regress = c("nUMI","percent.mito"))

Principal Components Analysis

Using the most variable genes (from the FindVariableGenes function, stored in the var.genes slot), calculate principal components:

s1=RunPCA(s1,pc.genes=s1@var.genes,do.print=T,pcs.print = 1:5,genes.print = 5)

Seurat includes a number of visualization tools.  Here’s example commands for each of them:

#lists top 5 genes per PC
PrintPCA(s1,pcs.print = 1:5,genes.print = 5,use.full=F) 

#plots component of each of top genes per PC
VizPCA(s1,pcs.use = 1:2) 

#all cells plotted on first two PCs
PCAPlot(s1,dim.1=1,dim.2=2) 

#heatmap showing top genes for first PC only
PCHeatmap(s1,pc.use=1,cells.use=500,do.balanced = T,label.columns = F)
VizPlot output for first two PCs

 

 

 

 

The heatmap command can also be useful to visualize how many PCs explain variance–try 6, 12, or 18.

PCHeatmap(s1,pc.use=1:6,cells.use = 500,do.balanced = T,label.columns = F,use.full=F)
Heatmap of first 6 PCs. Note how the heatmaps become less distinct with higher-numbered PCs.

 

 

 

 

Seurat also includes a method to evaluate statistically significant PCs using jackStraw:

s1=JackStraw(s1,num.replicate=100,display.progress = T)

#to find how many are significant
JackStrawPlot(s1,PCs=1:18)
#another, simpler way to visualize 
PCElbowPlot(s1)
PCA Elbow Plot.

Clustering

The graph-based clustering method in Seurat relies on the PCA space for data reduction and uses methods similar to KNN and SLM–see the Seurat website for details.  Choose how many PC dimensions you want to include based on the elbow plot.

s1=FindClusters(s1,reduction.type = "pca",dims.use=1:10,resolution=0.6,print.output = 0,save.SNN = T)

#same PCA plot as above, now colored by cluster
PCAPlot(s1,dim.1=1,dim.2=2)

A better way to visualize clusters is to use the tSNE plot:

s1=RunTSNE(s1,dims.use=1:10,do.fast=T)
TSNEPlot(s1) # plot in tSNE space
tSNE Plot Colored by Cluster Number

Genes distinguishing clusters

You can then search for genes that distinguish clusters.  In this example, I chose cluster 1 because I knew it expressed a number of characteristic neuronal markers.  In this example I asked for genes distinguishing cluster 1 from all other clusters but you can also do binary comparisons if you like, using “ident.2” as another argument.

cluster1.markers=FindMarkers(s1,ident.1=1,min.pct=0.25)
print(head(cluster1.markers,n=5))
      p_val avg_logFC pct.1 pct.2 p_val_adj
Aldoc     0 0.9771343 0.935 0.511         0
Hes5      0 0.8220392 0.494 0.179         0
Fabp7     0 0.7561873 0.821 0.405         0
Sparc     0 0.7385394 0.780 0.372         0
Gstm1     0 0.7359628 0.726 0.386         0

Extend this to search for markers for all clusters:

s1.markers=FindAllMarkers(s1,only.pos=T,min.pct=0.25,thresh.use=0.25)
#Display a short table showing top 2 genes per cluster
s1.markers %>% group_by(cluster) %>% top_n(2,avg_logFC)
# A tibble: 16 x 7
# Groups:   cluster [8]
       p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene    
       <dbl>     <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>   
 1 7.62e- 38     0.352 0.306 0.212 1.07e- 33 0       Ptgds   
 2 4.24e- 30     0.325 0.413 0.331 5.93e- 26 0       Selenbp1
 3 0.            0.977 0.935 0.511 0.        1       Aldoc   
 4 6.58e-298     0.829 0.499 0.189 9.21e-294 1       Apoe    
 5 0.            1.36  0.757 0.298 0.        2       Tubb3   
 6 0.            1.16  0.729 0.337 0.        2       Mllt11  
 7 0.            1.37  0.686 0.145 0.        3       Top2a   
 8 0.            1.23  0.851 0.326 0.        3       Hmgb2   
 9 0.            1.51  0.978 0.647 0.        4       Dbi     
10 0.            1.37  0.603 0.123 0.        4       Ntrk2   
11 3.49e- 74     0.512 0.707 0.38  4.88e- 70 5       Hmgb2   
12 7.08e- 67     0.489 0.473 0.204 9.91e- 63 5       Top2a   
13 0.            1.29  0.61  0.103 0.        6       Lhx1    
14 1.15e-222     1.12  0.712 0.242 1.61e-218 6       Cks1b   
15 0.            1.67  0.766 0.137 0.        7       Notch2  
16 1.72e-174     0.944 0.317 0.052 2.40e-170 7       Slc39a1 

This is only one of the possible differential expression tests available–see this vignette to get a list of all of them.  For one more example, we’ll use the “roc” method to identify differential genes:

cluster1.roc=FindMarkers(s1,ident.1 = 1,thresh.use=0.25,test.use="roc",only.pos=T)
print(head(cluster1.roc,n=5))
      myAUC  avg_diff power avg_logFC pct.1 pct.2 p_val_adj
Aldoc 0.807 0.9771343 0.614 0.9771343 0.935 0.511        NA
Cst3  0.793 0.7022215 0.586 0.7022215 0.997 0.848        NA
Ptn   0.746 0.6726631 0.492 0.6726631 0.918 0.576        NA
Fabp7 0.736 0.7561873 0.472 0.7561873 0.821 0.405        NA
Itm2b 0.736 0.6530243 0.472 0.6530243 0.905 0.588        NA

You can now plot distributions of expression for each cluster for specific genes in a violin plot:

VlnPlot(s1,features.plot = c("Aldoc","Hmgb2"))
Violin plot for two genes split by cluster number

Or you can plot a longer list overlaid on tSNE plots:

FeaturePlot(s1,features.plot = c("Ptgds","Aldoc","Tubb3","Top2a","Dbi","Hmgb2","Lhx1","Notch2"),cols.use=c("grey","blue"),reduction.use="tsne")
tSNE plot highlighting top genes per cluster

Finally, you can visualize the top gene markers per cluster and plot a heatmap across all cells:

top10=s1.markers %>% group_by(cluster) %>% top_n(10,avg_logFC)
DoHeatmap(s1,genes.use=top10$gene,slim.col.label=T,remove.key = T)
Heatmap for top genes per cluster, for each cell within each cluster

Next up, two samples…

 

 

Analysis of single-cell RNAseq data with CellrangerRkit

Now that you’ve run cellranger count and maybe even cellranger aggr on your single-cell RNAseq samples, you’re ready to start exploring.

As discussed previously, you have results to explore without firing up your RStudio.  This page describes many of the output files.  The cellranger output includes the following useful files:

  • sample/outs/web_summary.html – Open with your web browser to see basic QC on the library, any warnings about low-quality results, and simple summary statistics.  If you used aggr, it lists counts by library. Click on “Analysis” and you’ll see some basic tSNE plots (can be colored by Kmeans clustering or library id) along with sortable lists of genes distinguishing Kmeans clusters.
  • sample/outs/cloupe.cloupe – As mentioned previously, this file can be opened with 10X Genomics Loupe Cell Browser.
  • sample/outs/analysis/diffexp/kmeans_n_clusters (n being 2 through 10) – Each folder contains a csv file listing differential expression of genes by cluster, along with p-values.
  • sample/outs/filtered_gene_bc_matrices_mex/mm10 (obviously I used mm10 genome) – this contains the files needed for loading into Seurat.  More on that in the next post.

This is only a basic sampling.  Explore the folders to find more.

Setting up R for cellrangerRkit

Start by downloading R from a CRAN mirror, and the free desktop version of RStudio.  Follow directions for your operating system for downloading and installing cellrangerRkit from 10X Genomics.

10X Genomics provides a nice vignette using public PBMC data.  My outline will follow this vignette with some explanations and variations.

The full R script file containing the commands shown here can be downloaded from this link.

Load your data

Fire up RStudio and load the library:

library(cellrangerRkit)

Use the Files tab in RStudio to change directories to where you have stored your sample results.  Choose “Set as working directory” from the More menu.  The system will enter this command for you (with the example location from my system):

setwd("C:/scRNAseq/agg")

The “agg” folder is the one I used to collect output from my cellranger aggr job from the previous post.  It contains the “outs” folder along with some other stuff.

Now you can use this same directory address to specify your “pipestance_path” (cellranger’s term for it):

cellranger_pipestance_path="C:/scRNAseq/agg"
gbm=load_cellranger_matrix(cellranger_pipestance_path)
analysis_results=load_cellranger_analysis_results(cellranger_pipestance_path)

At this point you’ve got a giant object for the GeneBCMatrix (gbm, 21.3 Mb in my experiment) and a list of analysis results (68.1 Mb).

Check expression levels

Let’s extract the tSNE plot data and visualize the RNA read depth for each cell.

tsne_proj=analysis_results$tsne
visualize_umi_counts(gbm,tsne_proj[c("TSNE.1","TSNE.2")],limits=c(3,4),marker_size = 0.05)
numUMIs
tSNE plot with each dot representing a single cell, colored by number of detected sequencing reads.

This will draw a standard tSNE plot with the total number of UMIs (unique molecular identifiers – the tag specific for each cell) for each cell.  Each dot is one cell.  The scale is in log10 of the UMI number per cell.

Looks good so far, let’s move on to choosing only expressed genes.

Nonzero genes and normalization

use_genes=get_nonzero_genes(gbm)
gbm_bcnorm=normalize_barcode_sums_to_median(gbm[use_genes,])
gbm_log=log_gene_bc_matrix(gbm_bcnorm,base=10)

This gives you an array of indices for the genes observed in any cell, and uses only these genes to normalize and log10 convert expression levels.

print(dim(gbm_log))
[1] 17627 57575

This gives you the number of non-zero genes (17,627) and cells (57,575) in your project.

Check expression of selected genes

If you already know a few genes that likely distinguish various sub-groups of cells in your experiment, you can plot them now.  Manually create an array of gene symbols first.

genes=c("Ddx21","Ldha","Sox11","Fabp7")
visualize_gene_markers(gbm_log,genes,tsne_proj[c("TSNE.1","TSNE.2")],limits=c(0,1.5))
gene tSNE plot
tSNE plots depicting detectable levels of indicated genes per cell.

This produces a faceted plot with each showing expression level by cell for that gene.

If you’re lucky, key cell types can be identified rapidly using this method.  If not, you’ll need to continue on to clustering to identify genes distinguishing clusters.

Plot libraries

Another simple example is to color the dots in the tSNE plot by library identifier.  Here we’re going to use a trick from this post at the 10X Genomics help site, where we strip the library id number from the grouped cell barcode identifier.  When cellranger aggr runs to combine multiple libraries, each one contains the same set of UMI barcodes.  So to distinguish them, it adds a “_1”, “_2” and so on for each library.  This code retrieves and extracts that to create a “gem_group” vector.

gem_group=sapply(strsplit(as.character(pData(gbm)$barcode),"-"), '[[',2)
visualize_clusters(gem_group,tsne_proj[c("TSNE.1","TSNE.2")])
tSNE by library
tSNE plot colored by sample number (library ID value).

The result is the same tSNE plot, now colored by library id value.  This should match up the order of samples in your agg_samples.csv file when you ran cellranger aggr.

Clearly in my case samples 3 and 4 are distinct from samples 1 and 2, which overlap nicely.

Clusters

Now let’s load the pre-calculated cluster data and explore a little.  Cellranger automatically creates Kmeans clusters with k ranging from 2 to 10.

n_clu=2:10
km_res=analysis_results$clustering
clu_res=sapply(n_clu,function(x) km_res[[paste("kmeans",x,"clusters",sep="_")]]$Cluster)
colnames(clu_res)=sapply(n_clu,function(x) paste("kmeans",x,sep="."))
visualize_clusters(clu_res,tsne_proj[c("TSNE.1","TSNE.2")])

This will plot panels for each k value, showing how the tSNE plot is divided by that number of clusters.

kmeans tSNE
tSNE plots split by k value (number of clusters) and colored by k-means cluster number.

I’ll choose k=5 to examine more closely.

example_K=5
example_col=rev(brewer.pal(example_K,"Set2"))
cluster_result=analysis_results$clustering[[paste("kmeans",example_K,"clusters",sep="_")]]
visualize_clusters(cluster_result$Cluster,tsne_proj[c("TSNE.1","TSNE.2")],colour=example_col)
5k cluster tSNE
tSNE plot colored by cluster number with k=5 clusters.

 

We’ll next extract genes that best distinguish the 5 clusters and draw a heatmap of the top 3 genes to compare among clusters.

cells_to_plot=order_cell_by_clusters(gbm,cluster_result$Cluster)
prioritized_genes=prioritize_top_genes(gbm,cluster_result$Cluster,"sseq",min_mean=0.5)gbm_pheatmap(log_gene_bc_matrix(gbm),prioritized_genes,cells_to_plot,n_genes=3,colour=example_col,limits=c(-1,2))
heatmap k5 clusters
Heatmap of top 3 genes per cluster, with vertical slices representing individual cells.

The numbers of cells per cluster and the proportion of cells in each cluster is displayed with this function:

cell_composition(cluster_result$Cluster)
Cell composition: 
                    1          2          3          4          5
num_cells  1.6848e+04 1.2031e+04 1.1501e+04 9348.00000 7847.00000
proportion 2.9263e-01 2.0896e-01 1.9976e-01    0.16236    0.13629

Finally, you can output gene lists by cluster.  First be sure to create a “gene_sets” folder within your working directory, then:

output_folder="I:/scRNAseq/agg/gene_sets"
write_cluster_specific_genes(prioritized_genes,output_folder,n_genes=20)

Next step…

We’ll compare with SeuratChapter 1: Single sample analysis.

Single-Cell RNAseq with CellRanger on the Perceval Cluster

The 10X Chromium system has become the gold standard for single-cell sequencing so it’s time to learn how to use 10X Genomics’ Cell Ranger software for processing results.  They’ve made the pipeline pretty easy.  The main limitation is that larger amounts of RAM (>64 Gb) are required for a reasonable analysis time.  I was able to install and run Cell Ranger on a 24 Gb Linux desktop but it took over a day to process a single sample.  The Rutgers Perceval cluster is a much better solution.  Most all nodes have at least 128 Gb RAM and usually 24 CPUs per node.

Samples were prepared and run on a 10X Genomics Chromium Controller.  Library prep followed 10X Genomics protocols.

We started by working with RUCDR Infinite Biologics to run the sequencing on an Illumina HiSeq system.  They correctly extracted the reads from the Illumina raw base call (BCL) files into one set of paired-end FASTQ files for us.

Fastq files and renaming

The problem was that the naming convention in the files we received did not match Cell Ranger’s preferences.  To fix this I used the Linux “rename” command.  This command is slightly different on different Linux installations.  In one form, you feed it a regex-style string.  On my system it used the older form like this:

rename <search> <replace> <files>

So my input files were named:

SampleName_R1_001.fastq.gz

(As well as a matching R2 file.) I needed them formatted like this:

SampleName_S1_L001_R1_001.fastq.gz

Where S1 is for sample 1, S2 for sample 2, etc.

Furthermore, it’s much easier to work with fastq files where the two files are in a single directory separated from other samples.  In my case I created four sample directories, each with a code name for the sample.  I moved the two appropriate fastq files into each sample directory. Then I renamed the files.  For each sample, I used this command:

rename SampleName SampleName_S1_L001 *

This was repeated for each sample.  I’m sure it would be easy to write a shell script to do all this but there’s seldom enough samples in a single-cell experiment to be worth the trouble.

Installing Cell Ranger

Go to the 10X Genomics Support site to download the current version of Cell Ranger.  Very conveniently, they post a curl or wget command to download the installer.  Copy one of these (I prefer wget) and login to the Perceval cluster.  Issue the wget comment to download.

Also download the appropriate reference dataset for your samples.  In my case I used the mm10 mouse reference.  I saved the archive to my /scratch/user/genomes folder and unpacked it.

To install, just unpack the archive and move the folder to a convenient location.  I used ~/bin.  Make sure to add this to your $PATH.  My preference is to add it to the .bash_profile.  Add a line like this:

PATH=$HOME/bin/cellranger-2.1.1:$PATH

and then re-load the profile like this:

source ~/.bash_profile

At this point you should be able to output the correct location with a which command:

which cellranger

The package is self-contained so merely unpacking it and adding it to your path should work.  To check, run the sitecheck command:

cellranger sitecheck > sitecheck.txt

This saves a bunch of installation-specific parameters to a file that you can review.  You can choose to upload the file to the 10X Genomics server and have them confirm your installation but that’s not necessary on Perceval (since we already know it works there).

Move files

Copy your renamed fastq files and directory structure to the /scratch/user space on Perceval using FileZilla.

SLURM Count Script

As with my earlier Perceval projects, I try to create a single batch script that can launch all samples in parallel using the array feature of SLURM.  At first I worked on a shell script to find all the sub-directories I had set up for the fastq files.  Then I decided to be lazy and just hard-code arrays of the required sample ID’s and the corresponding directory locations.  Here’s my working script, named CRcount.sh:

#!/bin/bash

#SBATCH -J CRcount
#SBATCH --nodes=1 
#SBATCH --cpus-per-task=16 #cpu threads per node
#SBATCH --mem=124000 #mem per node in MB
#SBATCH --time=6:00:00 
#SBATCH -p main
#SBATCH --export=ALL
#SBATCH --array=0-3 #range of numbers to use for the array. 
#SBATCH -o ./output/CRcount-%A-%a.out
#SBATCH -e ./output/CRcount-%A-%a.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=user@rutgers.edu

#here's my hard-coded samples lists
sampNames=(sample1 sample2 sample3 sample4)
dirNames=(/scratch/user/fastq/Sample1 /scratch/user/fastq/Sample2 /scratch/user/fastq/Sample3 /scratch/user/fastq/Sample4)

#use the SLURM array ID to pick one of the samples for processing
sampName=${sampNames[${SLURM_ARRAY_TASK_ID}]}
dirName=${dirNames[${SLURM_ARRAY_TASK_ID}]}
#grab the base sample name from the location
baseName=$(basename "${dirName}")

if [ ! -d ${dirName} ]
then
echo "${dirName} file not found! Stopping!"
exit 1
else
srun cellranger count --id=${sampName} --fastqs=${dirName} --sample=${baseName} --transcriptome=/scratch/user/genomes/Mus_musculus/refdata-cellranger-mm10-2.1.0 --expect-cells=10000
fi

This version takes my hard-coded ID names and directory locations, picks one per instance of the batch file (from the –array=0-3 line), checks that the directory exists, and then starts.  I manually entered the name of my mouse downloaded genome reference from 10X Genomics.  In my experiment, we loaded 20,000 cells and expect about 50% to be sequenced, so I manually entered 10,000 expected cells.  Your mileage may vary.

Note that I set a time limit of 6 hours.  This will depend on the number of reads in your library.  For my samples, 2 hours wasn’t long enough and even 4 hours failed for one sample.  If you do reach the end of your time limit, remember to delete the incomplete output folder so that cellranger doesn’t think there’s another job working on that output.

Issue the command:

sbatch scripts/CRcount.sh

Once all four libraries had finished running with the cellranger count command, the result is a set of four directories, each named with your “id” string from the command line.  There’s a file named web_summary.html in the outs subdirectory.  Load that into a web browser to view basic QC on your sample.

Similarly, there’s a file name cloupe.cloupe in the outs subdirectory that can be loaded into the 10X Genomics Loupe Cell Browser.

Aggregating libraries

To compare all samples side-by-side you need to re-run cellranger to combine the results into a single dataset.  This is done with the aggregate function of cellranger. First, create a CSV file containing the sample ID’s and the location of the molecule_info.h5 file from each sample.  Here’s mine, named agg_samples.csv:

library_id,molecule_h5
sample1,./sample1/outs/molecule_info.h5
sample2,./sample2/outs/molecule_info.h5
sample3,./sample3/outs/molecule_info.h5
sample4,./sample4/outs/molecule_info.h5

Now you’re ready to submit a single (non-array) SLURM script to aggregate the samples.  Here’s my SLURM script, named CRagg.sh:

#!/bin/bash

#SBATCH -J CRagg
#SBATCH --nodes=1 
#SBATCH --cpus-per-task=16 #cpu threads per node
#SBATCH --mem=124000 #mem per node in MB
#SBATCH --time=5:59:00 
#SBATCH -p main
#SBATCH --export=ALL
#SBATCH -o ./output/CRagg-%A-%a.out
#SBATCH -e ./output/CRagg-%A-%a.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=user@rutgers.edu


srun cellranger aggr --id=agg --csv=agg_samples.csv --normalize=mapped

If you’re confident of your cellranger count command array working you can even link the batch execution to successful completion of the earlier script.  Grab the job id from your CRcount.sh sbatch submission and issue this command:

sbatch --dependency=afterok:<jobid> scripts/CRagg.sh

Now you’ll see the CRcount jobs as well as the CRagg job in your squeue output, with (Dependency) listed for CRagg until all the count jobs are done.  No need to wait around.

When this is all done you’ll have a new subdirectory (agg, based on the id string in the command).  As before, there’s a web_summary.html and a cloupe.cloupe file to check results without further analysis.

Next, analyze results in R…

There are two excellent R packages that load cellranger output and allow customized analyses–cellrangerRkit and Seurat.

Acknowlegments

The Perceval cluster was supported in part by a grant from NIH (1S10OD012346-01A1) and is operated by the Rutgers Office of Advanced Research Computing.  Initial single-cell sequencing data for testing these scripts came from Dr. Kelvin Kwan.

Kallisto pipeline on a cluster

The Pachter lab recently released Kallisto, a novel strategy for RNAseq data analysis.  In this approach, RNAseq reads are not aligned with genome but instead matched to a KMER index based on the known transcripts for a specific organism.  Reads are only used to distinguish isoforms and count abundance.  This vastly reduces the run time and memory requirements–I found that it runs nicely on my Mac laptop!  However, for large numbers of samples, it is still simpler to take advantage of a parallel computing environment.  Furthermore, since I normally do my own primer filtering and deduplication, I had to find the appropriate methods to use on non-aligned fastq files.

I decided it made sense to deduplicate first.  I found a number of links for programs that can deduplicate fastq files, and even paired-end fastq files.  I settled on using ParDRe, which had a number of features and options that I liked.  The source code is found on Sourceforge.  The README file says to adjust the listing of libraries in the makefile to suit your environment but I found that I only had to load the mvapich2 module (v.2.2) on Perceval to compile without errors.  I also had to load the intel module (v. 16.0.3) to provide the runtime tools (mpirun).  (Don’t forget to load module intel before running the script!) The final executable was copied to my user’s bin directory (~/bin), which is in my standard $PATH.

For the filtering step, I decided to use fastp, which combines a trimmomatic-like trimmer with a quality evaluation output like FastQC.  This time, I cheated and downloaded the Linux binary file directly to my user’s bin directory.

Finally, for Kallisto, I downloaded the Linux archive, compiled, and copied the executable to my bin directory.  As with fastp,you can also use the git clone method.

Now we’re ready to build the SLURM scripts.  In my previous RNAseq pipeline, I had used linked aliases to the actual input files to simplify the use of SLURM arrays.  But I decided to instead use some simple bash commands to build an array of the input file names in the current working directory and then pick one file out with the $SLURM_ARRAY_TASK_ID variable.  This way, all the files retain original sample ID’s and naming conventions.  I should note that my sequencing service lab delivers FastQ file names in the format:  SAMPLEID_R1_001.fastq.gz so my script builds on that format.  If your input format is different you’ll need to adjust the filename editing commands.  Importantly, in each script I have to manually set the index array variable range to begin with 0 (since bash arrays begin with 0) and end at the number of samples minus one.  So, in my example scripts, I specify 0-11 for 12 samples.  You will need to manually adjust this for every set of samples.

I broke the pipeline into three steps so I could check intermediate results and optimize cluster parameters in each SLURM script separately.  You may prefer to combine all steps into one script or chain them together by making subsequent steps dependent on completion of the prior step (see SLURM examples to see how this is done).

The first step is for ParDRe deduplication.  First, use bash to create an array of all the R1 files:

R1=(`ls *R1_dedup_filt.fastq.gz`)

Then choose the current member of the array using the $SLURM_ARRAY_TASK_ID variable.  I combined this with removing the “.gz” suffix.  A second step removes the “.fastq” suffix.

fname=${R1[$SLURM_ARRAY_TASK_ID]%\.*}
fname=${fname%\.*}

Now we have the appropriate basename in the form “SAMPLEID_R1_001”.  I prepare the output file name using a substitution.

dedupName=${fname/001/dedup}

Before proceeding to the ParDre command, I check for the presence of the matching R2 command.  If it’s found, begin the mpirun command with the paired-end input files and the edited output file names.  Using -n 4 in the mpirun command specifies 4 nodes–this must match the –nodes directive in the SBATCH section.  The -z 6 specifies the default gzip compression in the input fastq.gz files.  Finally, the -t 8 option asks for 8 threads per node and again, this must match the SBATCH directives.

if [ ! -f ${fname/R1/R2}.fastq.gz ]
then
 echo "${fname/R1/R2}.fastq.gz file not found! Stopping!"
 exit 1
else
 #echo "${fname/R1/R2}.fastq.gz file found."
 mpirun -n 4 ~/bin/ParDRe -z 6 \
 -i ${fname}.fastq.gz \
 -p ${fname/R1/R2}.fastq.gz \
 -o ${dedupName}.fastq.gz \
 -r ${dedupName/R1/R2}.fastq.gz \
 -t 8
fi

Here’s a link to the entire script file.

When this is complete, the fastp step is run.  I did some testing and found that a single node per job and 4 threads (as used in the fastp examples) ran about as fast as any other combination.  I again input the R1 files (this time deduplicated versions), check for R2, and run the command.  The html and json files are given sample-specific names.  After fastp runs, I check for successful completion and then delete the intermediate deduplicated files.  Here’s the script file.

The last step uses Kallisto and a transcripts.idx index to output counts for quantitation.  (Note, a good place to download a cDNA FASTA file with which to build an index is at the Ensemble FTP Download page.)   Each sample writes the standard files into a new directory named according to the SAMPLEID portion of the input file name.  Note that the number of bootstraps can easily be increased to the recommended 1000 range.  Here’s the script.

When you’re done, transfer all the sample directories to your desktop and load into sleuth, running in R.  Here’s my steps to load all the data:

library(sleuth)

#grab sample names from folders under "results"
sample_id=dir(file.path(".","results"))

#grab subfolders containing kallisto output
kal_dirs=file.path(".","results",sample_id)

#assemble samples table manually
s2c=data.frame(sample=sample_id,genotype=c(rep("Flox",3),rep("KO",3),rep("Flox",3),rep("KO",3)),treatment=c(rep("noPMA",6),rep("PMA",6)),path=kal_dirs,stringsAsFactors = F)

In this example I have 12 samples with 3 replicates each of 4 groups, defined by two factors.  Adjust as needed.

Hope this post gets you started using SLURM to run Kallisto.  Naturally, you might choose to change the dedup and/or filtering steps, or leave them out altogether.

Update:  I recently had a dataset where I had to reverse the order of the fastp and ParDRe steps.  It seemed as if ParDRe complained about the fastq files but filtering them with fastp solved the problem.  The logic of doing the adapter trimming before or after the deduplication is a great topic for further discussion.