Blog

Single-Cell RNAseq with CellRanger on the Perceval Cluster

The 10X Chromium system has become the gold standard for single-cell sequencing so it’s time to learn how to use 10X Genomics’ Cell Ranger software for processing results.  They’ve made the pipeline pretty easy.  The main limitation is that larger amounts of RAM (>64 Gb) are required for a reasonable analysis time.  I was able to install and run Cell Ranger on a 24 Gb Linux desktop but it took over a day to process a single sample.  The Rutgers Perceval cluster is a much better solution.  Most all nodes have at least 128 Gb RAM and usually 24 CPUs per node.

Samples were prepared and run on a 10X Genomics Chromium Controller.  Library prep followed 10X Genomics protocols.

We started by working with RUCDR Infinite Biologics to run the sequencing on an Illumina HiSeq system.  They correctly extracted the reads from the Illumina raw base call (BCL) files into one set of paired-end FASTQ files for us.

Fastq files and renaming

The problem was that the naming convention in the files we received did not match Cell Ranger’s preferences.  To fix this I used the Linux “rename” command.  This command is slightly different on different Linux installations.  In one form, you feed it a regex-style string.  On my system it used the older form like this:

rename <search> <replace> <files>

So my input files were named:

SampleName_R1_001.fastq.gz

(As well as a matching R2 file.) I needed them formatted like this:

SampleName_S1_L001_R1_001.fastq.gz

Where S1 is for sample 1, S2 for sample 2, etc.

Furthermore, it’s much easier to work with fastq files where the two files are in a single directory separated from other samples.  In my case I created four sample directories, each with a code name for the sample.  I moved the two appropriate fastq files into each sample directory. Then I renamed the files.  For each sample, I used this command:

rename SampleName SampleName_S1_L001 *

This was repeated for each sample.  I’m sure it would be easy to write a shell script to do all this but there’s seldom enough samples in a single-cell experiment to be worth the trouble.

Installing Cell Ranger

Go to the 10X Genomics Support site to download the current version of Cell Ranger.  Very conveniently, they post a curl or wget command to download the installer.  Copy one of these (I prefer wget) and login to the Perceval cluster.  Issue the wget comment to download.

Also download the appropriate reference dataset for your samples.  In my case I used the mm10 mouse reference.  I saved the archive to my /scratch/user/genomes folder and unpacked it.

To install, just unpack the archive and move the folder to a convenient location.  I used ~/bin.  Make sure to add this to your $PATH.  My preference is to add it to the .bash_profile.  Add a line like this:

PATH=$HOME/bin/cellranger-2.1.1:$PATH

and then re-load the profile like this:

source ~/.bash_profile

At this point you should be able to output the correct location with a which command:

which cellranger

The package is self-contained so merely unpacking it and adding it to your path should work.  To check, run the sitecheck command:

cellranger sitecheck > sitecheck.txt

This saves a bunch of installation-specific parameters to a file that you can review.  You can choose to upload the file to the 10X Genomics server and have them confirm your installation but that’s not necessary on Perceval (since we already know it works there).

Move files

Copy your renamed fastq files and directory structure to the /scratch/user space on Perceval using FileZilla.

SLURM Count Script

As with my earlier Perceval projects, I try to create a single batch script that can launch all samples in parallel using the array feature of SLURM.  At first I worked on a shell script to find all the sub-directories I had set up for the fastq files.  Then I decided to be lazy and just hard-code arrays of the required sample ID’s and the corresponding directory locations.  Here’s my working script, named CRcount.sh:

#!/bin/bash

#SBATCH -J CRcount
#SBATCH --nodes=1 
#SBATCH --cpus-per-task=16 #cpu threads per node
#SBATCH --mem=124000 #mem per node in MB
#SBATCH --time=6:00:00 
#SBATCH -p main
#SBATCH --export=ALL
#SBATCH --array=0-3 #range of numbers to use for the array. 
#SBATCH -o ./output/CRcount-%A-%a.out
#SBATCH -e ./output/CRcount-%A-%a.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=user@rutgers.edu

#here's my hard-coded samples lists
sampNames=(sample1 sample2 sample3 sample4)
dirNames=(/scratch/user/fastq/Sample1 /scratch/user/fastq/Sample2 /scratch/user/fastq/Sample3 /scratch/user/fastq/Sample4)

#use the SLURM array ID to pick one of the samples for processing
sampName=${sampNames[${SLURM_ARRAY_TASK_ID}]}
dirName=${dirNames[${SLURM_ARRAY_TASK_ID}]}
#grab the base sample name from the location
baseName=$(basename "${dirName}")

if [ ! -d ${dirName} ]
then
echo "${dirName} file not found! Stopping!"
exit 1
else
srun cellranger count --id=${sampName} --fastqs=${dirName} --sample=${baseName} --transcriptome=/scratch/user/genomes/Mus_musculus/refdata-cellranger-mm10-2.1.0 --expect-cells=10000
fi

This version takes my hard-coded ID names and directory locations, picks one per instance of the batch file (from the –array=0-3 line), checks that the directory exists, and then starts.  I manually entered the name of my mouse downloaded genome reference from 10X Genomics.  In my experiment, we loaded 20,000 cells and expect about 50% to be sequenced, so I manually entered 10,000 expected cells.  Your mileage may vary.

Note that I set a time limit of 6 hours.  This will depend on the number of reads in your library.  For my samples, 2 hours wasn’t long enough and even 4 hours failed for one sample.  If you do reach the end of your time limit, remember to delete the incomplete output folder so that cellranger doesn’t think there’s another job working on that output.

Issue the command:

sbatch scripts/CRcount.sh

Once all four libraries had finished running with the cellranger count command, the result is a set of four directories, each named with your “id” string from the command line.  There’s a file named web_summary.html in the outs subdirectory.  Load that into a web browser to view basic QC on your sample.

Similarly, there’s a file name cloupe.cloupe in the outs subdirectory that can be loaded into the 10X Genomics Loupe Cell Browser.

Aggregating libraries

To compare all samples side-by-side you need to re-run cellranger to combine the results into a single dataset.  This is done with the aggregate function of cellranger. First, create a CSV file containing the sample ID’s and the location of the molecule_info.h5 file from each sample.  Here’s mine, named agg_samples.csv:

library_id,molecule_h5
sample1,./sample1/outs/molecule_info.h5
sample2,./sample2/outs/molecule_info.h5
sample3,./sample3/outs/molecule_info.h5
sample4,./sample4/outs/molecule_info.h5

Now you’re ready to submit a single (non-array) SLURM script to aggregate the samples.  Here’s my SLURM script, named CRagg.sh:

#!/bin/bash

#SBATCH -J CRagg
#SBATCH --nodes=1 
#SBATCH --cpus-per-task=16 #cpu threads per node
#SBATCH --mem=124000 #mem per node in MB
#SBATCH --time=5:59:00 
#SBATCH -p main
#SBATCH --export=ALL
#SBATCH -o ./output/CRagg-%A-%a.out
#SBATCH -e ./output/CRagg-%A-%a.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=user@rutgers.edu


srun cellranger aggr --id=agg --csv=agg_samples.csv --normalize=mapped

If you’re confident of your cellranger count command array working you can even link the batch execution to successful completion of the earlier script.  Grab the job id from your CRcount.sh sbatch submission and issue this command:

sbatch --dependency=afterok:<jobid> scripts/CRagg.sh

Now you’ll see the CRcount jobs as well as the CRagg job in your squeue output, with (Dependency) listed for CRagg until all the count jobs are done.  No need to wait around.

When this is all done you’ll have a new subdirectory (agg, based on the id string in the command).  As before, there’s a web_summary.html and a cloupe.cloupe file to check results without further analysis.

Next, analyze results in R…

There are two excellent R packages that load cellranger output and allow customized analyses–cellrangerRkit and Seurat.

Acknowlegments

The Perceval cluster was supported in part by a grant from NIH (1S10OD012346-01A1) and is operated by the Rutgers Office of Advanced Research Computing.  Initial single-cell sequencing data for testing these scripts came from Dr. Kelvin Kwan.

Advertisement

Kallisto pipeline on a cluster

The Pachter lab recently released Kallisto, a novel strategy for RNAseq data analysis.  In this approach, RNAseq reads are not aligned with genome but instead matched to a KMER index based on the known transcripts for a specific organism.  Reads are only used to distinguish isoforms and count abundance.  This vastly reduces the run time and memory requirements–I found that it runs nicely on my Mac laptop!  However, for large numbers of samples, it is still simpler to take advantage of a parallel computing environment.  Furthermore, since I normally do my own primer filtering and deduplication, I had to find the appropriate methods to use on non-aligned fastq files.

I decided it made sense to deduplicate first.  I found a number of links for programs that can deduplicate fastq files, and even paired-end fastq files.  I settled on using ParDRe, which had a number of features and options that I liked.  The source code is found on Sourceforge.  The README file says to adjust the listing of libraries in the makefile to suit your environment but I found that I only had to load the mvapich2 module (v.2.2) on Perceval to compile without errors.  I also had to load the intel module (v. 16.0.3) to provide the runtime tools (mpirun).  (Don’t forget to load module intel before running the script!) The final executable was copied to my user’s bin directory (~/bin), which is in my standard $PATH.

For the filtering step, I decided to use fastp, which combines a trimmomatic-like trimmer with a quality evaluation output like FastQC.  This time, I cheated and downloaded the Linux binary file directly to my user’s bin directory.

Finally, for Kallisto, I downloaded the Linux archive, compiled, and copied the executable to my bin directory.  As with fastp,you can also use the git clone method.

Now we’re ready to build the SLURM scripts.  In my previous RNAseq pipeline, I had used linked aliases to the actual input files to simplify the use of SLURM arrays.  But I decided to instead use some simple bash commands to build an array of the input file names in the current working directory and then pick one file out with the $SLURM_ARRAY_TASK_ID variable.  This way, all the files retain original sample ID’s and naming conventions.  I should note that my sequencing service lab delivers FastQ file names in the format:  SAMPLEID_R1_001.fastq.gz so my script builds on that format.  If your input format is different you’ll need to adjust the filename editing commands.  Importantly, in each script I have to manually set the index array variable range to begin with 0 (since bash arrays begin with 0) and end at the number of samples minus one.  So, in my example scripts, I specify 0-11 for 12 samples.  You will need to manually adjust this for every set of samples.

I broke the pipeline into three steps so I could check intermediate results and optimize cluster parameters in each SLURM script separately.  You may prefer to combine all steps into one script or chain them together by making subsequent steps dependent on completion of the prior step (see SLURM examples to see how this is done).

The first step is for ParDRe deduplication.  First, use bash to create an array of all the R1 files:

R1=(`ls *R1_dedup_filt.fastq.gz`)

Then choose the current member of the array using the $SLURM_ARRAY_TASK_ID variable.  I combined this with removing the “.gz” suffix.  A second step removes the “.fastq” suffix.

fname=${R1[$SLURM_ARRAY_TASK_ID]%\.*}
fname=${fname%\.*}

Now we have the appropriate basename in the form “SAMPLEID_R1_001”.  I prepare the output file name using a substitution.

dedupName=${fname/001/dedup}

Before proceeding to the ParDre command, I check for the presence of the matching R2 command.  If it’s found, begin the srun command with the paired-end input files and the edited output file names.  Using -n 4 in the srun command specifies 4 nodes–this must match the –nodes directive in the SBATCH section.  Using –mpi=pmi2 tells srun which MPI library to use.  The -z 6 specifies the default gzip compression in the input fastq.gz files.  Finally, the -t 8 option asks for 8 threads per node and again, this must match the SBATCH directives.

if [ ! -f ${fname/R1/R2}.fastq.gz ]
then
 echo "${fname/R1/R2}.fastq.gz file not found! Stopping!"
 exit 1
else
 #echo "${fname/R1/R2}.fastq.gz file found."
 srun --mpi=pmi2 -n 4 ~/bin/ParDRe -z 6 \
 -i ${fname}.fastq.gz \
 -p ${fname/R1/R2}.fastq.gz \
 -o ${dedupName}.fastq.gz \
 -r ${dedupName/R1/R2}.fastq.gz \
 -t 8
fi

Here’s a link to the entire script file.

When this is complete, the fastp step is run.  I did some testing and found that a single node per job and 4 threads (as used in the fastp examples) ran about as fast as any other combination.  I again input the R1 files (this time deduplicated versions), check for R2, and run the command.  The html and json files are given sample-specific names.  After fastp runs, I check for successful completion and then delete the intermediate deduplicated files.  Here’s the script file.

The last step uses Kallisto and a transcripts.idx index to output counts for quantitation.  (Note, a good place to download a cDNA FASTA file with which to build an index is at the Ensemble FTP Download page.)   Each sample writes the standard files into a new directory named according to the SAMPLEID portion of the input file name.  Note that the number of bootstraps can easily be increased to the recommended 1000 range.  Here’s the script.

When you’re done, transfer all the sample directories to your desktop and load into sleuth, running in R.  Here’s my steps to load all the data:

library(sleuth)

#grab sample names from folders under "results"
sample_id=dir(file.path(".","results"))

#grab subfolders containing kallisto output
kal_dirs=file.path(".","results",sample_id)

#assemble samples table manually
s2c=data.frame(sample=sample_id,genotype=c(rep("Flox",3),rep("KO",3),rep("Flox",3),rep("KO",3)),treatment=c(rep("noPMA",6),rep("PMA",6)),path=kal_dirs,stringsAsFactors = F)

In this example I have 12 samples with 3 replicates each of 4 groups, defined by two factors.  Adjust as needed.

Hope this post gets you started using SLURM to run Kallisto.  Naturally, you might choose to change the dedup and/or filtering steps, or leave them out altogether.

Update:  I recently had a dataset where I had to reverse the order of the fastp and ParDRe steps.  It seemed as if ParDRe complained about the fastq files but filtering them with fastp solved the problem.  The logic of doing the adapter trimming before or after the deduplication is a great topic for further discussion.

Bowtie2 on Perceval

Following up on my previous post about using HISAT/StringTie on Perceval, I thought it would be useful to include a post about a simple Bowtie2 alignment, such as one for ChIP-seq.

Modules required

Perceval requires that specific software modules be loaded in the user’s environment prior to submitting jobs to the cluster.  For this pipeline, you’ll need bowtie2, samtools, and java.

module load bowtie2
module load samtools
module load java

(or)

module load bowtie2 samtools jav

SLURM Arrays

When running many parallel alignments it’s convenient to take advantage of SLURM’s array capability.  Before starting, make sure your input fastq files are named in such a way that the names are identical except for an index number, from 1 to the number of samples.  In this case, all my files are named AE1.fastq.gz through AE6.fastq.gz

Steps

  1. Prepare input fastq files.  I like to put all this in one subdirectory named “fastq”.  If multiple lanes were used, you can merely concatenate the gzipped files like this:
    cat AE1_AACCAG.R1.L001.fastq.gz AE1_AACCAG.R1.L002.fastq.gz > AE1.fastq.gz
  2. Edit a script to run the bowtie2 command.  Save this into a file named something like “arrayBowtie2.sh” Here’s my example:
    #!/bin/bash
    #SBATCH -J bt2 #jobname (optional)
    #SBATCH -n 1 #nodes
    #SBATCH -c 8 #cpu threads per node
    #SBATCH -t 3:59:00 #estimated run time
    #SBATCH -p main #partition name
    #SBATCH --export=ALL
    #SBATCH -o /home/user/data/output/bt2-%A-%a.out #where to put output if any
    #SBATCH -e /home/user/data/output/bt2-%A-%a.err #error output
    #SBATCH --array=1-6 #I have 6 samples
    #SBATCH --mail-type=END,FAIL  #send an email on end or if failure
    #SBATCH --mail-user=me@a.com #where to send email
    
    #align using default parameters with unpaired reads
    bowtie2 -p 8 -x /home/user/genomes/bowtie2/mm10 \
    -U /home/user/data/fastq/AE"${SLURM_ARRAY_TASK_ID}".fastq.gz \
    -S /home/user/data/results/AE"${SLURM_ARRAY_TASK_ID}".sam
    
    #note - I split the command with continue symbols ("\") but I normally don't
    #do this in my scripts
  3. Submit the sbatch command like this:
    sbatch arrayBowtie2.sh
  4. You can check progress with:
    squeue -u user
  5. Next you’ll want to run samtools to convert sam to bam, sort, and index the files.  Create a script file named something like “arraySamtools.sh” Submit using sbatch as before.  Here’s an example:
    #SBATCH -J samtools
    #SBATCH -n 1
    #SBATCH -t 3:59:00
    #SBATCH -p main
    #SBATCH --export=ALL
    #SBATCH -o /home/user/data/output/samtools-%A-%a.out
    #SBATCH -e /home/user/data/output/samtools-%A-%a.err
    #SBATCH --array=1-6 
    #SBATCH --mail-type=END,FAIL  
    #SBATCH --mail-user=me@a.com 
    
    #convert sam to bam
    samtools view -b -S /home/user/data/results/AE"${SLURM_ARRAY_TASK_ID}".sam > \
    /home/user/data/results/AE"${SLURM_ARRAY_TASK_ID}".bam
    
    #remove unneeded sam
    rm /home/user/data/results/AE"${SLURM_ARRAY_TASK_ID}".sam
    
    #sort using scratch disk space (faster) for work files
    samtools sort -T /scratch/user/xxx"${SLURM_ARRAY_TASK_ID}" \
    -o /home/user/data/results/AE"${SLURM_ARRAY_TASK_ID}".sorted.bam \
    /home/user/data/results/AE"${SLURM_ARRAY_TASK_ID}".bam
    
    #remove unneeded pre-sort bam files
    rm /home/user/data/results/AE"${SLURM_ARRAY_TASK_ID}".bam
    
    #index the new sorted files
    samtools index /home/user/data/results/AE"${SLURM_ARRAY_TASK_ID}".sorted.bam
  6. Now you’ll have sorted and indexed bam files.  For most applications, it’s helpful to remove PCR duplicates.  Fire up your editor and create a script file for running Picard.  Submit using sbatch as before.  Here’s an example:
    #!/bin/bash
    #SBATCH -J picard
    #SBATCH -n 1
    #SBATCH -t 3:59:00
    #SBATCH -p main
    #SBATCH --export=ALL
    #SBATCH -o /home/user/data/output/picard-%A-%a.out
    #SBATCH -e /home/user/data/output/picard-%A-%a.err
    #SBATCH --array=1-6 
    #SBATCH --mail-type=END,FAIL  
    #SBATCH --mail-user=me@a.com 
    
    #use java to run Picard's MarkDuplicates function
    java -Xmx4g -jar /home/user/bin/picard.jar MarkDuplicates \
    INPUT=/home/user/data/results/AE"${SLURM_ARRAY_TASK_ID}".sorted.bam \
    OUTPUT=/home/user/data/results/AE"${SLURM_ARRAY_TASK_ID}".uniq.bam \
    METRICS_FILE=/home/user/data/results/AE"${SLURM_ARRAY_TASK_ID}".metrics \
    REMOVE_DUPLICATES=true ASSUME_SORTED=true
    
    #index
    samtools index /home/user/data/results/AE"${SLURM_ARRAY_TASK_ID}".uniq.bam
  7. Now you’ll have deduplicated (“uniq”) versions and bai indexes.  I usually retain the original sorted bams as well as the deduplicated ones just in case.  Now you’re ready to start peak finding.

 

 

Modifying cummeRbund Plots

Don’t know why it took me so long but I realized that the RNAseq plots in cummeRbund are merely ggplot2 objects that can be modified.  Here’s an example of how you can clean up and alter the appearance of cummeRbund plots.

Say I have a list of genes and I want to draw an expression barplot for each gene and each sample.  There’s a nice command in cummeRbund, expressionBarplot(), that does this for you.  You can even feed as input a getGenes() command to extract only a subset of genes.  So if I have a vector of gene symbols named ctxList and a vector giving the order I want the samples displayed, named sampleIdList, I can issue this command:

expressionBarplot(getGenes(cuff,ctxList,sampleIdList=sampOrder))

I get this plot:

rplotexample1

But I want to fix a number of things.  I don’t like the default grey background.  I want to control how the error bars are drawn–in this case I want them based on confidence interval.  And I wanted to clean up colors and fonts.  So I wrote this function in R:

myPlotStyle=function(p, leg=T){
  if(leg){
    pos=c(.9,.8)
  } else {
    pos="none"
  }
  p+theme(axis.title.y=element_text(family="Arial",size=18),
          axis.text.y=element_text(family="Arial",size=16),
          axis.line.y=element_line(color="black",size=1),
          axis.text.x=element_text(family="Arial",size=18,color="black",vjust=.5),
          axis.line.x=element_line(color="black",size=1),
          axis.ticks.x=element_blank(),
          legend.title=element_blank(),
          legend.text=element_text(family="Arial",size=18),
          legend.position=pos,
          panel.background=element_rect(fill="white"))+
    scale_fill_brewer(palette="Set1")+
    geom_errorbar(mapping=aes(ymin=conf_lo,ymax=conf_hi),size=.2,na.rm = F,
                  width=.5,stat="identity",position=position_dodge(width=.9))
}

I wanted:

  • An option whether to include a legend or not, and to place it in the upper right corner if present
  • The ability to use the system font “Arial” (this requires the library extrafont, and I’ve already imported Arial into R before this step)
  • Set sizes of font and choices to exclude ticks and legend title
  • Use a white background
  • Bars colored using the color brewer palette
  • Finally, errorbars based on confidence intervals.  I figured out how the expressionBarplot function stored alternate variation metrics and picked the one I wanted.  There are others if you look. For example, you could use standard deviation by specifying fpkm-stdev and fpkm+stdev for ymin and ymax in the geom_errorbar() function

After loading the function into your environment, just re-issue your barplot command like this:

myPlotStyle(p=expressionBarplot(getGenes(cuff,ctxList,sampleIdList=sampOrder),
 showErrorbars=F),leg=T)

rplotexample2

This gives me something publishable.  I tell expressionBarplot not to plot error bars and add my own.  I can easily re-use the function for other plots with other lists of genes.  The colors are much more vibrant than the defaults.

The data are taken from Oni et al., 2016 — See Supplemental Figure 3.

 

Class Attendance Experiment

On Monday, I conducted a simple experiment in my Advanced Cell Biology class.  My hypothesis was that the sub-population of students who attend class is likely to score higher on exams.  To test my hypothesis, I looked up the frequency of scoring an “A” on Exam 1.  53 students scored an “A” out of 117 enrolled (yes, that seems high, but our CBN majors are really smart!).  During class I took a survey using PollEverywhere (below).  16 out of 26 students who responded reported that they scored an “A” on the same exam. More than 26 students attended the class but I figured that 26 was a reasonable sample of those who attended. What’s the probability that this happened by chance alone?exam1surveyclassattendance

This is an example of the classic “marble and urn” problem.  You have an urn containing 100 marbles–80 of them are black and 20 of them are white.  If you randomly draw 10 marbles from the urn, what’s the chance of drawing 8 black and 2 white?  Or of drawing 3 black and 7 white?  The probability can be calculated based on the hypergeometric distribution.

The “null hypothesis” is that the students attending class are a random subset of all those enrolled in the class (that is, no particularly enrichment of A’s).  Among my sample of 26 votes, I counted 16 A’s.  The total contents of the “urn” was 53 A’s and 117-53 grades that were not an A.

I calculated this using the dhyper function in the statistical programming environment R.  Here’s the command:

> sum(dhyper(16:26,53,117-53,26))
[1] 0.04829361

The arguments are:

  • 16:26 (the number of sampled students who got an A up to the total number sampled; I use the range to find out the probability of finding 16 or more A students in a sample of 26)
  • 53 (the total number of students in the class who got an A)
  • 117 – 53 (the total number of students in the class who didn’t get an A)
  • 26 (the sample size)

The result says the 0.048 probability of the hypergeometric distribution produces an enrichment of 16 A’s or more out of 26 samples.  So, at a threshold of p<0.05, I reject the null hypothesis and conclude that students attending class are more likely to score higher on exams.  This is not a cause-and-effect proof that attending class will get you a higher grade.  But it may be worth considering!  And certainly if you did not get an A and want to do better, it might help to attend class!