Kallisto pipeline on a cluster

The Pachter lab recently released Kallisto, a novel strategy for RNAseq data analysis.  In this approach, RNAseq reads are not aligned with genome but instead matched to a KMER index based on the known transcripts for a specific organism.  Reads are only used to distinguish isoforms and count abundance.  This vastly reduces the run time and memory requirements–I found that it runs nicely on my Mac laptop!  However, for large numbers of samples, it is still simpler to take advantage of a parallel computing environment.  Furthermore, since I normally do my own primer filtering and deduplication, I had to find the appropriate methods to use on non-aligned fastq files.

I decided it made sense to deduplicate first.  I found a number of links for programs that can deduplicate fastq files, and even paired-end fastq files.  I settled on using ParDRe, which had a number of features and options that I liked.  The source code is found on Sourceforge.  The README file says to adjust the listing of libraries in the makefile to suit your environment but I found that I only had to load the mvapich2 module (v.2.2) on Perceval to compile without errors.  I also had to load the intel module (v. 16.0.3) to provide the runtime tools (mpirun).  (Don’t forget to load module intel before running the script!) The final executable was copied to my user’s bin directory (~/bin), which is in my standard $PATH.

For the filtering step, I decided to use fastp, which combines a trimmomatic-like trimmer with a quality evaluation output like FastQC.  This time, I cheated and downloaded the Linux binary file directly to my user’s bin directory.

Finally, for Kallisto, I downloaded the Linux archive, compiled, and copied the executable to my bin directory.  As with fastp,you can also use the git clone method.

Now we’re ready to build the SLURM scripts.  In my previous RNAseq pipeline, I had used linked aliases to the actual input files to simplify the use of SLURM arrays.  But I decided to instead use some simple bash commands to build an array of the input file names in the current working directory and then pick one file out with the $SLURM_ARRAY_TASK_ID variable.  This way, all the files retain original sample ID’s and naming conventions.  I should note that my sequencing service lab delivers FastQ file names in the format:  SAMPLEID_R1_001.fastq.gz so my script builds on that format.  If your input format is different you’ll need to adjust the filename editing commands.  Importantly, in each script I have to manually set the index array variable range to begin with 0 (since bash arrays begin with 0) and end at the number of samples minus one.  So, in my example scripts, I specify 0-11 for 12 samples.  You will need to manually adjust this for every set of samples.

I broke the pipeline into three steps so I could check intermediate results and optimize cluster parameters in each SLURM script separately.  You may prefer to combine all steps into one script or chain them together by making subsequent steps dependent on completion of the prior step (see SLURM examples to see how this is done).

The first step is for ParDRe deduplication.  First, use bash to create an array of all the R1 files:

R1=(`ls *R1_dedup_filt.fastq.gz`)

Then choose the current member of the array using the $SLURM_ARRAY_TASK_ID variable.  I combined this with removing the “.gz” suffix.  A second step removes the “.fastq” suffix.


Now we have the appropriate basename in the form “SAMPLEID_R1_001”.  I prepare the output file name using a substitution.


Before proceeding to the ParDre command, I check for the presence of the matching R2 command.  If it’s found, begin the srun command with the paired-end input files and the edited output file names.  Using -n 4 in the srun command specifies 4 nodes–this must match the –nodes directive in the SBATCH section.  Using –mpi=pmi2 tells srun which MPI library to use.  The -z 6 specifies the default gzip compression in the input fastq.gz files.  Finally, the -t 8 option asks for 8 threads per node and again, this must match the SBATCH directives.

if [ ! -f ${fname/R1/R2}.fastq.gz ]
 echo "${fname/R1/R2}.fastq.gz file not found! Stopping!"
 exit 1
 #echo "${fname/R1/R2}.fastq.gz file found."
 srun --mpi=pmi2 -n 4 ~/bin/ParDRe -z 6 \
 -i ${fname}.fastq.gz \
 -p ${fname/R1/R2}.fastq.gz \
 -o ${dedupName}.fastq.gz \
 -r ${dedupName/R1/R2}.fastq.gz \
 -t 8

Here’s a link to the entire script file.

When this is complete, the fastp step is run.  I did some testing and found that a single node per job and 4 threads (as used in the fastp examples) ran about as fast as any other combination.  I again input the R1 files (this time deduplicated versions), check for R2, and run the command.  The html and json files are given sample-specific names.  After fastp runs, I check for successful completion and then delete the intermediate deduplicated files.  Here’s the script file.

The last step uses Kallisto and a transcripts.idx index to output counts for quantitation.  (Note, a good place to download a cDNA FASTA file with which to build an index is at the Ensemble FTP Download page.)   Each sample writes the standard files into a new directory named according to the SAMPLEID portion of the input file name.  Note that the number of bootstraps can easily be increased to the recommended 1000 range.  Here’s the script.

When you’re done, transfer all the sample directories to your desktop and load into sleuth, running in R.  Here’s my steps to load all the data:


#grab sample names from folders under "results"

#grab subfolders containing kallisto output

#assemble samples table manually
s2c=data.frame(sample=sample_id,genotype=c(rep("Flox",3),rep("KO",3),rep("Flox",3),rep("KO",3)),treatment=c(rep("noPMA",6),rep("PMA",6)),path=kal_dirs,stringsAsFactors = F)

In this example I have 12 samples with 3 replicates each of 4 groups, defined by two factors.  Adjust as needed.

Hope this post gets you started using SLURM to run Kallisto.  Naturally, you might choose to change the dedup and/or filtering steps, or leave them out altogether.

Update:  I recently had a dataset where I had to reverse the order of the fastp and ParDRe steps.  It seemed as if ParDRe complained about the fastq files but filtering them with fastp solved the problem.  The logic of doing the adapter trimming before or after the deduplication is a great topic for further discussion.


One thought on “Kallisto pipeline on a cluster”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.