Single-Cell RNAseq with CellRanger on the Perceval Cluster

The 10X Chromium system has become the gold standard for single-cell sequencing so it’s time to learn how to use 10X Genomics’ Cell Ranger software for processing results.  They’ve made the pipeline pretty easy.  The main limitation is that larger amounts of RAM (>64 Gb) are required for a reasonable analysis time.  I was able to install and run Cell Ranger on a 24 Gb Linux desktop but it took over a day to process a single sample.  The Rutgers Perceval cluster is a much better solution.  Most all nodes have at least 128 Gb RAM and usually 24 CPUs per node.

Samples were prepared and run on a 10X Genomics Chromium Controller.  Library prep followed 10X Genomics protocols.

We started by working with RUCDR Infinite Biologics to run the sequencing on an Illumina HiSeq system.  They correctly extracted the reads from the Illumina raw base call (BCL) files into one set of paired-end FASTQ files for us.

Fastq files and renaming

The problem was that the naming convention in the files we received did not match Cell Ranger’s preferences.  To fix this I used the Linux “rename” command.  This command is slightly different on different Linux installations.  In one form, you feed it a regex-style string.  On my system it used the older form like this:

rename <search> <replace> <files>

So my input files were named:

SampleName_R1_001.fastq.gz

(As well as a matching R2 file.) I needed them formatted like this:

SampleName_S1_L001_R1_001.fastq.gz

Where S1 is for sample 1, S2 for sample 2, etc.

Furthermore, it’s much easier to work with fastq files where the two files are in a single directory separated from other samples.  In my case I created four sample directories, each with a code name for the sample.  I moved the two appropriate fastq files into each sample directory. Then I renamed the files.  For each sample, I used this command:

rename SampleName SampleName_S1_L001 *

This was repeated for each sample.  I’m sure it would be easy to write a shell script to do all this but there’s seldom enough samples in a single-cell experiment to be worth the trouble.

Installing Cell Ranger

Go to the 10X Genomics Support site to download the current version of Cell Ranger.  Very conveniently, they post a curl or wget command to download the installer.  Copy one of these (I prefer wget) and login to the Perceval cluster.  Issue the wget comment to download.

Also download the appropriate reference dataset for your samples.  In my case I used the mm10 mouse reference.  I saved the archive to my /scratch/user/genomes folder and unpacked it.

To install, just unpack the archive and move the folder to a convenient location.  I used ~/bin.  Make sure to add this to your $PATH.  My preference is to add it to the .bash_profile.  Add a line like this:

PATH=$HOME/bin/cellranger-2.1.1:$PATH

and then re-load the profile like this:

source ~/.bash_profile

At this point you should be able to output the correct location with a which command:

which cellranger

The package is self-contained so merely unpacking it and adding it to your path should work.  To check, run the sitecheck command:

cellranger sitecheck > sitecheck.txt

This saves a bunch of installation-specific parameters to a file that you can review.  You can choose to upload the file to the 10X Genomics server and have them confirm your installation but that’s not necessary on Perceval (since we already know it works there).

Move files

Copy your renamed fastq files and directory structure to the /scratch/user space on Perceval using FileZilla.

SLURM Count Script

As with my earlier Perceval projects, I try to create a single batch script that can launch all samples in parallel using the array feature of SLURM.  At first I worked on a shell script to find all the sub-directories I had set up for the fastq files.  Then I decided to be lazy and just hard-code arrays of the required sample ID’s and the corresponding directory locations.  Here’s my working script, named CRcount.sh:

#!/bin/bash

#SBATCH -J CRcount
#SBATCH --nodes=1 
#SBATCH --cpus-per-task=16 #cpu threads per node
#SBATCH --mem=124000 #mem per node in MB
#SBATCH --time=6:00:00 
#SBATCH -p main
#SBATCH --export=ALL
#SBATCH --array=0-3 #range of numbers to use for the array. 
#SBATCH -o ./output/CRcount-%A-%a.out
#SBATCH -e ./output/CRcount-%A-%a.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=user@rutgers.edu

#here's my hard-coded samples lists
sampNames=(sample1 sample2 sample3 sample4)
dirNames=(/scratch/user/fastq/Sample1 /scratch/user/fastq/Sample2 /scratch/user/fastq/Sample3 /scratch/user/fastq/Sample4)

#use the SLURM array ID to pick one of the samples for processing
sampName=${sampNames[${SLURM_ARRAY_TASK_ID}]}
dirName=${dirNames[${SLURM_ARRAY_TASK_ID}]}
#grab the base sample name from the location
baseName=$(basename "${dirName}")

if [ ! -d ${dirName} ]
then
echo "${dirName} file not found! Stopping!"
exit 1
else
srun cellranger count --id=${sampName} --fastqs=${dirName} --sample=${baseName} --transcriptome=/scratch/user/genomes/Mus_musculus/refdata-cellranger-mm10-2.1.0 --expect-cells=10000
fi

This version takes my hard-coded ID names and directory locations, picks one per instance of the batch file (from the –array=0-3 line), checks that the directory exists, and then starts.  I manually entered the name of my mouse downloaded genome reference from 10X Genomics.  In my experiment, we loaded 20,000 cells and expect about 50% to be sequenced, so I manually entered 10,000 expected cells.  Your mileage may vary.

Note that I set a time limit of 6 hours.  This will depend on the number of reads in your library.  For my samples, 2 hours wasn’t long enough and even 4 hours failed for one sample.  If you do reach the end of your time limit, remember to delete the incomplete output folder so that cellranger doesn’t think there’s another job working on that output.

Issue the command:

sbatch scripts/CRcount.sh

Once all four libraries had finished running with the cellranger count command, the result is a set of four directories, each named with your “id” string from the command line.  There’s a file named web_summary.html in the outs subdirectory.  Load that into a web browser to view basic QC on your sample.

Similarly, there’s a file name cloupe.cloupe in the outs subdirectory that can be loaded into the 10X Genomics Loupe Cell Browser.

Aggregating libraries

To compare all samples side-by-side you need to re-run cellranger to combine the results into a single dataset.  This is done with the aggregate function of cellranger. First, create a CSV file containing the sample ID’s and the location of the molecule_info.h5 file from each sample.  Here’s mine, named agg_samples.csv:

library_id,molecule_h5
sample1,./sample1/outs/molecule_info.h5
sample2,./sample2/outs/molecule_info.h5
sample3,./sample3/outs/molecule_info.h5
sample4,./sample4/outs/molecule_info.h5

Now you’re ready to submit a single (non-array) SLURM script to aggregate the samples.  Here’s my SLURM script, named CRagg.sh:

#!/bin/bash

#SBATCH -J CRagg
#SBATCH --nodes=1 
#SBATCH --cpus-per-task=16 #cpu threads per node
#SBATCH --mem=124000 #mem per node in MB
#SBATCH --time=5:59:00 
#SBATCH -p main
#SBATCH --export=ALL
#SBATCH -o ./output/CRagg-%A-%a.out
#SBATCH -e ./output/CRagg-%A-%a.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=user@rutgers.edu


srun cellranger aggr --id=agg --csv=agg_samples.csv --normalize=mapped

If you’re confident of your cellranger count command array working you can even link the batch execution to successful completion of the earlier script.  Grab the job id from your CRcount.sh sbatch submission and issue this command:

sbatch --dependency=afterok:<jobid> scripts/CRagg.sh

Now you’ll see the CRcount jobs as well as the CRagg job in your squeue output, with (Dependency) listed for CRagg until all the count jobs are done.  No need to wait around.

When this is all done you’ll have a new subdirectory (agg, based on the id string in the command).  As before, there’s a web_summary.html and a cloupe.cloupe file to check results without further analysis.

Next, analyze results in R…

There are two excellent R packages that load cellranger output and allow customized analyses–cellrangerRkit and Seurat.

Acknowlegments

The Perceval cluster was supported in part by a grant from NIH (1S10OD012346-01A1) and is operated by the Rutgers Office of Advanced Research Computing.  Initial single-cell sequencing data for testing these scripts came from Dr. Kelvin Kwan.

Advertisements

3 thoughts on “Single-Cell RNAseq with CellRanger on the Perceval Cluster”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.