Tuxedo2 Pipeline on Perceval

The new Tuxedo RNAseq pipeline (HISAT2->StringTie->Ballgown) can be set up to take good advantage of the parallelization available on the Perceval high performance computing cluster.  Several steps in the process are run independently for each RNA sample.  This document gives an example of one efficient method for using the power of SLURM arrays to run these parallel steps with a set of simple, generalizable scripts.

Recent presentation covering HPC and basic SLURM steps.

 Required Software

The Perceval cluster has a pre-built HISAT2 module that is up to date (at this writing).  Samtools is also available as a module.  However, the Picard java package and the StringTie executable must be downloaded and installed within the user’s account.  This can be done by creating a ~/bin directory, placing desired executables in that location, and adding it to your PATH.

For this workflow, we’ll take advantage of SLURM arrays.  A script is created with the appropriate commands with sbatch to specify resources required for each job.  The array feature uses a variable to specify a range of input files.  Given these components, SLURM will create a separate cluster job for each member of the array.  Using this feature you can write a single script file to execute as many jobs as you have samples.

On the Perceval cluster, use the /scratch/user space for all data and transfers within your jobs.  This means loading your raw fastq files here, saving a genome index file here, and saving output here.  The /scratch space is much faster than /home but it is not backed up so be sure to copy your data off the cluster when you’re done (or at least to your /home space).

One important note that is unique to the Picard cluster.  Modules must be loaded when you are logged into the headnode before submitting jobs to the cluster.  So, before you start, make sure you “load module HISAT2” and “load module samtools” — see the Perceval Quick Start Guide for more information on modules and versions.

 Steps

The trick to using the array feature of SLURM is to have a set of filenames with an easily-coded numeric value that can be referenced by the script.  My strategy is as follows:

    1.  Transfer fastq files to Perceval. From Windows this is simply accomplished using FileZilla.  From Linux or Mac, use rsync or scp to use an SSH-compatible file transfer.  Store all fastq files in a convenient location—I always use a directory named “fastq”.
    2. Extract the names of every fastq file into a text file. For example:
       find $(pwd) -name "*.fastq.gz" > fastq.txt

      The $(pwd) is used to store the full directory/file structure.  Not totally needed but often convenient.

    3. Transfer the contents of the fastq.txt file to Excel. This is a simple way to create a custom ln command for each file.  There’s lots of easy ways to do this (shell script, python)—choose your favorite.  See the attached Excel file and the resulting shell script for examples.  The result is a shell script with a single ln command in each line, one for each fastq file.
    4. Run the ln script from the directory where you want the links to be—the fastq folder for example. This creates a soft link for each fastq file with a numbered name.
    5. Now you’re ready to run HISAT2. See the enclosed sbatch script named “arrayHisat.sh” (below).  Important parts of the script:
      -J jobname-not required (default is name of script)
      -c 8 – this tells SLURM how many threads your program will run. It should match the –p value in the command line.
      -array=1-15 – this is the key element for setting up the array of commands.  In my case I want to run 15 individual commands, replacing the “$(SLURM_ARRAY_TASK_ID)” variable with each number in the range.
      -mail… – If you use these two commands the system will email you when the job is done.  I generally issue the execution command and then logout and wait for the email.  If you don’t want this just delete these two lines.

      #!/bin/bash
      
      #SBATCH -J hisat #jobname (optional) 
      #SBATCH -n 1 #nodes
      #SBATCH -c 8 #cpu threads per node
      #SBATCH -t 3:59:00 #estimated run time
      #SBATCH -p main #partition--just use main
      #SBATCH --export=ALL
      #SBATCH -o /home/user/output/hisat-%A-%a.out #where to put output if any
      #SBATCH -e /home/user/output/hisat-%A-%a.err #where to put error output
      #SBATCH --array=1-15  #range of numbers to use for the array.  In my case I have 15 samples.
      #SBATCH --mail-type=END,FAIL  #send an email on end or if failure
      #SBATCH --mail-user=user@rutgers.edu #where to send email
      
      hisat2 -t -p 8 --known-splicesite-infile /scratch/user/data/hg19-splicesites.txt -x /scratch/user/genomes/hisat2/hg19/genome -1  /scratch/user/data/lfastq/fastq"${SLURM_ARRAY_TASK_ID}"_R1.fastq.gz -2 /scratch/user/data/lfastq/fastq"${SLURM_ARRAY_TASK_ID}"_R2.fastq.gz -S /scratch/user/data/aligned"${SLURM_ARRAY_TASK_ID}".sam
    6. You may wish to edit the hisat2 command line to suit your preferences. Here’s the key features that I chose:
      -t – Report execution time in output.  Not needed for anything. I just like to know.
      -p 8 – Use 8 threads per node.  I haven’t tested more or less threads.  See the Perceval partitions to see how many CPUs per node you can use.
      –known-splicesite-infile – This is optional, but I think you’ll probably want to use this feature.  It allows you to load a text file containing every known splice site from the reference GTF file for your genome assembly.  As opposed to TopHat, using this reference splice map is useful whether you with to search for novel assemblies or not.  You create this file using the hisat2_extract_splice_sites.py python script that comes with HISAT2.  See the HISAT2 documentation.
      -x – location of the index files.  For most genomes there are pre-built indices on the HISAT2 website.  Download them to Perceval and store them in your scratch space.
      -1 and -2 point to the names of the paired-end fastq.gz files (R1 and R2, respectively).  Since we’ve previously created soft links varying only by the sample number, we can specify the full range of names using the “$(SLURM_ARRAY_TASK_ID)” variable.
      -S – location and name for the output SAM file created by HISAT2.
    7. If you want to use HISAT2 with Cufflinks instead of StringTie, add this option to your command:
      –dta-cufflinks
    8. Store the arrayHisat.sh script someplace convenient. Since I modify these scripts for every experiment I don’t put them in my ~/bin directory.  I keep a “scripts” directory just below my working directory for the project.
    9. Before running, remember to load HISAT2 local installation with the “module load HISAT2” command.  Then issue the command to send the script to SLURM:
       sbatch scripts/arrayHisat.sh
    10. You can monitor progress with:
       squeue –u <username>

      or

       squeue –j <jobnumber>
    11. When HISAT2 is complete for every sample, you should have one SAM file for each sample (or pair of fastq files).  These need to be converted to BAM files, sorted, and indexed.  Use the arraySamtools.sh script (below).  Similar in elements to the arrayHisat.sh script, SBATCH commands differ only in the job name and the elimination of the –c flag, because samtools is not multi-threaded.  Each of the three steps is run as a separate command inside the same script.  Again, array-numbered file names are used to keep each sample separate.  New to samtools 1.3.1 is the requirement to specify the temp file name.  This is important to keep individual sample temp files separate, so use the sample number (from the array variable) in each name.  Use sbatch to issue the script to SLURM and wait for the job to be completed.  When you’re satisfied that the sorted, indexed BAM files have been created you can delete the SAM files and the original, unsorted BAM files. Don’t forget to “module load samtools” before submitting this.
      #!/bin/bash
      
      #SBATCH -J samtools
      #SBATCH -n 1
      #SBATCH -t 3:59:00
      #SBATCH -p main
      #SBATCH --export=ALL
      #SBATCH -o /home/user/data/output/samtools-%A-%a.out
      #SBATCH -e /home/user/data/output/samtools-%A-%a.err
      #SBATCH --array=1-15
      #SBATCH --mail-type=END,FAIL
      #SBATCH --mail-user=user@rutgers.edu
      
      samtools view -b -S /scratch/user/data/results/aligned"${SLURM_ARRAY_TASK_ID}".sam > /scratch/user/data/results/aligned"${SLURM_ARRAY_TASK_ID}".bam
      samtools sort -T /scratch/user/xxx"${SLURM_ARRAY_TASK_ID}" -o /scratch/user/data/results/aligned"${SLURM_ARRAY_TASK_ID}".sorted.bam /scratch/user/data/results/aligned"${SLURM_ARRAY_TASK_ID}".bam 
      samtools index /scratch/user/data/results/aligned"${SLURM_ARRAY_TASK_ID}".sorted.bam
      
    12. In some cases, if there is a concern that the RNAseq library has been over-amplified, you can eliminate duplicate reads using the Picard java package. See the example arrayPicard.sh script.  Script elements are similar to previous ones.  Don’t forget to “module load java” if you haven’t already before running this script.  I run this step as a separate script because it is not always part of the standard pipeline, but it could easily be added to the previous script.
      #!/bin/bash
      
      #SBATCH -J picard
      #SBATCH -n 1
      #SBATCH -t 3:59:00
      #SBATCH -p main
      #SBATCH --export=ALL
      #SBATCH -o /home/user/data/picard-%A-%a.out
      #SBATCH -e /home/user/data/picard-%A-%a.err
      #SBATCH --array=1-15
      #SBATCH --mail-type=END,FAIL
      #SBATCH --mail-user=user@rutgers.edu
      
      java -Xmx4g -jar /home/user/bin/picard.jar MarkDuplicates INPUT=/scratch/user/data/aligned"${SLURM_ARRAY_TASK_ID}".sorted.bam OUTPUT=/scratch/user/data/aligned"${SLURM_ARRAY_TASK_ID}".uniq.bam METRICS_FILE=/scratch/user/data/aligned"${SLURM_ARRAY_TASK_ID}".metrics REMOVE_DUPLICATES=true ASSUME_SORTED=true
      
    13. Now we’re ready to assemble transcripts with StringTie. In this first instance of StringTie, each of the BAM files (in this case the sorted, duplicates-removed versions) is used, along with a reference GTF file, to assemble aligned reads into transcripts.  This is done whether you intend to find novel transcripts or not.  The arrayStringtie.sh script includes the standard form of the command—you may choose other flags as needed.  In this example, I’ve specified 8 CPUs per job (and this must match the SBATCH –c 8 command).  The output from this step is a GTF file derived from the sample-specific alignments.
      #!/bin/bash
      
      #SBATCH -J stringtie
      #SBATCH -n 1
      #SBATCH -c 8
      #SBATCH -t 3:59:00
      #SBATCH -p main
      #SBATCH --export=ALL
      #SBATCH -o /home/user/data/stringtie-%A-%a.out
      #SBATCH -e /home/user/data/stringtie-%A-%a.err
      #SBATCH --array=1-15
      #SBATCH --mail-type=END,FAIL
      #SBATCH --mail-user=user@rutgers.edu
      
      /home/user/bin/stringtie/stringtie /scratch/user/data/aligned"${SLURM_ARRAY_TASK_ID}".uniq.bam -o /scratch/user/data/aligned"${SLURM_ARRAY_TASK_ID}".uniq.gtf -p 8 -G /scratch/user/data/hg19-genes.gtf 
      
    14. Now we re-run StringTie with the –merge flag. This is only done once for each experiment, not once for each sample, so formally it doesn’t need to be run on the cluster, but it’s convenient to use the same computing power and file locations in a simple script.  My sample is named stMerge.sh.  Create a text file with the names of all the custom GTF files and use this as input (use the find command as we did for the fastq files above).  Output is a single, merged GTF file.
      #!/bin/bash
      
      #SBATCH -J stMerge
      #SBATCH -n 1
      #SBATCH -c 8
      #SBATCH -t 3:59:00
      #SBATCH -p main
      #SBATCH --export=ALL
      #SBATCH -o /home/user/data/stMerge-%A-%a.out
      #SBATCH -e /home/user/data/stMerge-%A-%a.err
      #SBATCH --mail-type=END,FAIL
      #SBATCH --mail-user=user@rutgers.edu
      
      /home/user/bin/stringtie/stringtie --merge -o /scratch/user/data/merged.uniq.gtf -p 8 -G /scratch/user/data/hg19-genes.gtf /scratch/user/data/gtfList.txt
      
    15. The third and final use of StringTie takes this merged GTF file and each BAM file to output quantity measurements. Since we plan to pick up these output files in Ballgown, we’ll send results to a directory named “ballgown” and place each sample’s output in a sub-directory of this.  To tell StringTie to do this step, use the –e and –B flags.  Again, other flags are available—see the StringTie documentation.  See my sample ebStringtie.sh script.
      #!/bin/bash
      
      #SBATCH -J ebStringtie
      #SBATCH -n 1
      #SBATCH -c 8
      #SBATCH -t 3:59:00
      #SBATCH -p main
      #SBATCH --export=ALL
      #SBATCH -o /home/user/data/ebStringtie-%A-%a.out
      #SBATCH -e /home/user/data/ebStringtie-%A-%a.err
      #SBATCH --array=1-15
      #SBATCH --mail-type=END,FAIL
      #SBATCH --mail-user=user@rutgers.edu
      
      
      /home/user/bin/stringtie/stringtie -e -B -p 8 -G /scratch/user/data/merged.uniq.gtf -o /scratch/user/data/aligned"${SLURM_ARRAY_TASK_ID}"/aligned"${SLURM_ARRAY_TASK_ID}".gtf /scratch/user/data/aligned"${SLURM_ARRAY_TASK_ID}".uniq.bam 
      
    16. When this step is complete you can use FileZilla to transfer the entire ballgown directory back to your desktop. Fire up R and load the ballgown package.
    17. Alternatively, you can extract count data and proceed directly to edgeR or DESeq2.  Use the prepDE.py script provided on the StringTie web page.  I’ve successfully loaded the output from this script into DESeq2 for differential analysis and modeling.
Advertisements

2 thoughts on “Tuxedo2 Pipeline on Perceval”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s