Blog

Kallisto pipeline on a cluster

The Pachter lab recently released Kalliso, a novel strategy for RNAseq data analysis.  In this approach, RNAseq reads are not aligned with genome but instead matched to a KMER index based on the known transcripts for a specific organism.  Reads are only used to distinguish isoforms and count abundance.  This vastly reduces the run time and memory requirements–I found that it runs nicely on my Mac laptop!  However, for large numbers of samples, it is still simpler to take advantage of a parallel computing environment.  Furthermore, since I normally do my own primer filtering and deduplication, I had to find the appropriate methods to use on non-aligned fastq files.

I decided it made sense to deduplicate first.  I found a number of links for programs that can deduplicate fastq files, and even paired-end fastq files.  I settled on using ParDRe, which had a number of features and options that I liked.  The source code is found on Sourceforge.  The README file says to adjust the listing of libraries in the makefile to suit your environment but I found that I only had to load the mvapich2 module (v.2.2) on Perceval to compile without errors.  I also had to load the intel module (v. 16.0.3) to provide the runtime tools (mpirun).  (Don’t forget to load module intel before running the script!) The final executable was copied to my user’s bin directory (~/bin), which is in my standard $PATH.

For the filtering step, I decided to use fastp, which combines a trimmomatic-like trimmer with a quality evaluation output like FastQC.  This time, I cheated and downloaded the Linux binary file directly to my user’s bin directory.

Finally, for Kallisto, I downloaded the Linux archive, compiled, and copied the executable to my bin directory.  As with fastp,you can also use the git clone method.

Now we’re ready to build the SLURM scripts.  In my previous RNAseq pipeline, I had used linked aliases to the actual input files to simplify the use of SLURM arrays.  But I decided to instead use some simple bash commands to build an array of the input file names in the current working directory and then pick one file out with the $SLURM_ARRAY_TASK_ID variable.  This way, all the files retain original sample ID’s and naming conventions.  I should note that my sequencing service lab delivers FastQ file names in the format:  SAMPLEID_R1_001.fastq.gz so my script builds on that format.  If your input format is different you’ll need to adjust the filename editing commands.  Importantly, in each script I have to manually set the index array variable range to begin with 0 (since bash arrays begin with 0) and end at the number of samples minus one.  So, in my example scripts, I specify 0-11 for 12 samples.  You will need to manually adjust this for every set of samples.

I broke the pipeline into three steps so I could check intermediate results and optimize cluster parameters in each SLURM script separately.  You may prefer to combine all steps into one script or chain them together by making subsequent steps dependent on completion of the prior step (see SLURM examples to see how this is done).

The first step is for ParDRe deduplication.  First, use bash to create an array of all the R1 files:

R1=(`ls *R1_dedup_filt.fastq.gz`)

Then choose the current member of the array using the $SLURM_ARRAY_TASK_ID variable.  I combined this with removing the “.gz” suffix.  A second step removes the “.fastq” suffix.

fname=${R1[$SLURM_ARRAY_TASK_ID]%\.*}
fname=${fname%\.*}

Now we have the appropriate basename in the form “SAMPLEID_R1_001”.  I prepare the output file name using a substitution.

dedupName=${fname/001/dedup}

Before proceeding to the ParDre command, I check for the presence of the matching R2 command.  If it’s found, begin the mpirun command with the paired-end input files and the edited output file names.  Using -n 4 in the mpirun command specifies 4 nodes–this must match the –nodes directive in the SBATCH section.  The -z 6 specifies the default gzip compression in the input fastq.gz files.  Finally, the -t 8 option asks for 8 threads per node and again, this must match the SBATCH directives.

if [ ! -f ${fname/R1/R2}.fastq.gz ]
then
 echo "${fname/R1/R2}.fastq.gz file not found! Stopping!"
 exit 1
else
 #echo "${fname/R1/R2}.fastq.gz file found."
 mpirun -n 4 ~/bin/ParDRe -z 6 \
 -i ${fname}.fastq.gz \
 -p ${fname/R1/R2}.fastq.gz \
 -o ${dedupName}.fastq.gz \
 -r ${dedupName/R1/R2}.fastq.gz \
 -t 8
fi

Here’s a link to the entire script file.

When this is complete, the fastp step is run.  I did some testing and found that a single node per job and 4 threads (as used in the fastp examples) ran about as fast as any other combination.  I again input the R1 files (this time deduplicated versions), check for R2, and run the command.  The html and json files are given sample-specific names.  After fastp runs, I check for successful completion and then delete the intermediate deduplicated files.  Here’s the script file.

The last step uses Kallisto and a transcripts.idx index to output counts for quantitation.  Each sample writes the standard files into a new directory named according to the SAMPLEID portion of the input file name.  Note that the number of bootstraps can easily be increased to the recommended 1000 range.  Here’s the script.

When you’re done, transfer all the sample directories to your desktop and load into sleuth, running in R.  Here’s my steps to load all the data:

library(sleuth)

#grap sample names from folders under "results"
sample_id=dir(file.path(".","results"))

#grab subfolders containing kallisto output
kal_dirs=file.path(".","results",sample_id)

#assemble samples table manually
s2c=data.frame(sample=sample_id,genotype=c(rep("Flox",3),rep("KO",3),rep("Flox",3),rep("KO",3)),treatment=c(rep("noPMA",6),rep("PMA",6)),path=kal_dirs,stringsAsFactors = F)

In this example I have 12 samples with 3 replicates each of 4 groups, defined by two factors.  Adjust as needed.

Hope this post gets you started using SLURM to run Kallisto.  Naturally, you might choose to change the dedup and/or filtering steps, or leave them out altogether.

Advertisements

Bowtie2 on Perceval

Following up on my previous post about using HISAT/StringTie on Perceval, I thought it would be useful to include a post about a simple Bowtie2 alignment, such as one for ChIP-seq.

Modules required

Perceval requires that specific software modules be loaded in the user’s environment prior to submitting jobs to the cluster.  For this pipeline, you’ll need bowtie2, samtools, and java.

module load bowtie2
module load samtools
module load java

(or)

module load bowtie2 samtools jav

SLURM Arrays

When running many parallel alignments it’s convenient to take advantage of SLURM’s array capability.  Before starting, make sure your input fastq files are named in such a way that the names are identical except for an index number, from 1 to the number of samples.  In this case, all my files are named AE1.fastq.gz through AE6.fastq.gz

Steps

  1. Prepare input fastq files.  I like to put all this in one subdirectory named “fastq”.  If multiple lanes were used, you can merely concatenate the gzipped files like this:
    cat AE1_AACCAG.R1.L001.fastq.gz AE1_AACCAG.R1.L002.fastq.gz > AE1.fastq.gz
  2. Edit a script to run the bowtie2 command.  Save this into a file named something like “arrayBowtie2.sh” Here’s my example:
    #!/bin/bash
    #SBATCH -J bt2 #jobname (optional)
    #SBATCH -n 1 #nodes
    #SBATCH -c 8 #cpu threads per node
    #SBATCH -t 3:59:00 #estimated run time
    #SBATCH -p main #partition name
    #SBATCH --export=ALL
    #SBATCH -o /home/user/data/output/bt2-%A-%a.out #where to put output if any
    #SBATCH -e /home/user/data/output/bt2-%A-%a.err #error output
    #SBATCH --array=1-6 #I have 6 samples
    #SBATCH --mail-type=END,FAIL  #send an email on end or if failure
    #SBATCH --mail-user=me@a.com #where to send email
    
    #align using default parameters with unpaired reads
    bowtie2 -p 8 -x /home/user/genomes/bowtie2/mm10 \
    -U /home/user/data/fastq/AE"${SLURM_ARRAY_TASK_ID}".fastq.gz \
    -S /home/user/data/results/AE"${SLURM_ARRAY_TASK_ID}".sam
    
    #note - I split the command with continue symbols ("\") but I normally don't
    #do this in my scripts
  3. Submit the sbatch command like this:
    sbatch arrayBowtie2.sh
  4. You can check progress with:
    squeue -u user
  5. Next you’ll want to run samtools to convert sam to bam, sort, and index the files.  Create a script file named something like “arraySamtools.sh” Submit using sbatch as before.  Here’s an example:
    #SBATCH -J samtools
    #SBATCH -n 1
    #SBATCH -t 3:59:00
    #SBATCH -p main
    #SBATCH --export=ALL
    #SBATCH -o /home/user/data/output/samtools-%A-%a.out
    #SBATCH -e /home/user/data/output/samtools-%A-%a.err
    #SBATCH --array=1-6 
    #SBATCH --mail-type=END,FAIL  
    #SBATCH --mail-user=me@a.com 
    
    #convert sam to bam
    samtools view -b -S /home/user/data/results/AE"${SLURM_ARRAY_TASK_ID}".sam > \
    /home/user/data/results/AE"${SLURM_ARRAY_TASK_ID}".bam
    
    #remove unneeded sam
    rm /home/user/data/results/AE"${SLURM_ARRAY_TASK_ID}".sam
    
    #sort using scratch disk space (faster) for work files
    samtools sort -T /scratch/user/xxx"${SLURM_ARRAY_TASK_ID}" \
    -o /home/user/data/results/AE"${SLURM_ARRAY_TASK_ID}".sorted.bam \
    /home/user/data/results/AE"${SLURM_ARRAY_TASK_ID}".bam
    
    #remove unneeded pre-sort bam files
    rm /home/user/data/results/AE"${SLURM_ARRAY_TASK_ID}".bam
    
    #index the new sorted files
    samtools index /home/user/data/results/AE"${SLURM_ARRAY_TASK_ID}".sorted.bam
  6. Now you’ll have sorted and indexed bam files.  For most applications, it’s helpful to remove PCR duplicates.  Fire up your editor and create a script file for running Picard.  Submit using sbatch as before.  Here’s an example:
    #!/bin/bash
    #SBATCH -J picard
    #SBATCH -n 1
    #SBATCH -t 3:59:00
    #SBATCH -p main
    #SBATCH --export=ALL
    #SBATCH -o /home/user/data/output/picard-%A-%a.out
    #SBATCH -e /home/user/data/output/picard-%A-%a.err
    #SBATCH --array=1-6 
    #SBATCH --mail-type=END,FAIL  
    #SBATCH --mail-user=me@a.com 
    
    #use java to run Picard's MarkDuplicates function
    java -Xmx4g -jar /home/user/bin/picard.jar MarkDuplicates \
    INPUT=/home/user/data/results/AE"${SLURM_ARRAY_TASK_ID}".sorted.bam \
    OUTPUT=/home/user/data/results/AE"${SLURM_ARRAY_TASK_ID}".uniq.bam \
    METRICS_FILE=/home/user/data/results/AE"${SLURM_ARRAY_TASK_ID}".metrics \
    REMOVE_DUPLICATES=true ASSUME_SORTED=true
    
    #index
    samtools index /home/user/data/results/AE"${SLURM_ARRAY_TASK_ID}".uniq.bam
  7. Now you’ll have deduplicated (“uniq”) versions and bai indexes.  I usually retain the original sorted bams as well as the deduplicated ones just in case.  Now you’re ready to start peak finding.

 

 

Modifying cummeRbund Plots

Don’t know why it took me so long but I realized that the RNAseq plots in cummeRbund are merely ggplot2 objects that can be modified.  Here’s an example of how you can clean up and alter the appearance of cummeRbund plots.

Say I have a list of genes and I want to draw an expression barplot for each gene and each sample.  There’s a nice command in cummeRbund, expressionBarplot(), that does this for you.  You can even feed as input a getGenes() command to extract only a subset of genes.  So if I have a vector of gene symbols named ctxList and a vector giving the order I want the samples displayed, named sampleIdList, I can issue this command:

expressionBarplot(getGenes(cuff,ctxList,sampleIdList=sampOrder))

I get this plot:

rplotexample1

But I want to fix a number of things.  I don’t like the default grey background.  I want to control how the error bars are drawn–in this case I want them based on confidence interval.  And I wanted to clean up colors and fonts.  So I wrote this function in R:

myPlotStyle=function(p, leg=T){
  if(leg){
    pos=c(.9,.8)
  } else {
    pos="none"
  }
  p+theme(axis.title.y=element_text(family="Arial",size=18),
          axis.text.y=element_text(family="Arial",size=16),
          axis.line.y=element_line(color="black",size=1),
          axis.text.x=element_text(family="Arial",size=18,color="black",vjust=.5),
          axis.line.x=element_line(color="black",size=1),
          axis.ticks.x=element_blank(),
          legend.title=element_blank(),
          legend.text=element_text(family="Arial",size=18),
          legend.position=pos,
          panel.background=element_rect(fill="white"))+
    scale_fill_brewer(palette="Set1")+
    geom_errorbar(mapping=aes(ymin=conf_lo,ymax=conf_hi),size=.2,na.rm = F,
                  width=.5,stat="identity",position=position_dodge(width=.9))
}

I wanted:

  • An option whether to include a legend or not, and to place it in the upper right corner if present
  • The ability to use the system font “Arial” (this requires the library extrafont, and I’ve already imported Arial into R before this step)
  • Set sizes of font and choices to exclude ticks and legend title
  • Use a white background
  • Bars colored using the color brewer palette
  • Finally, errorbars based on confidence intervals.  I figured out how the expressionBarplot function stored alternate variation metrics and picked the one I wanted.  There are others if you look. For example, you could use standard deviation by specifying fpkm-stdev and fpkm+stdev for ymin and ymax in the geom_errorbar() function

After loading the function into your environment, just re-issue your barplot command like this:

myPlotStyle(p=expressionBarplot(getGenes(cuff,ctxList,sampleIdList=sampOrder),
 showErrorbars=F),leg=T)

rplotexample2

This gives me something publishable.  I tell expressionBarplot not to plot error bars and add my own.  I can easily re-use the function for other plots with other lists of genes.  The colors are much more vibrant than the defaults.

The data are taken from Oni et al., 2016 — See Supplemental Figure 3.

 

Class Attendance Experiment

On Monday, I conducted a simple experiment in my Advanced Cell Biology class.  My hypothesis was that the sub-population of students who attend class is likely to score higher on exams.  To test my hypothesis, I looked up the frequency of scoring an “A” on Exam 1.  53 students scored an “A” out of 117 enrolled (yes, that seems high, but our CBN majors are really smart!).  During class I took a survey using PollEverywhere (below).  16 out of 26 students who responded reported that they scored an “A” on the same exam. More than 26 students attended the class but I figured that 26 was a reasonable sample of those who attended. What’s the probability that this happened by chance alone?exam1surveyclassattendance

This is an example of the classic “marble and urn” problem.  You have an urn containing 100 marbles–80 of them are black and 20 of them are white.  If you randomly draw 10 marbles from the urn, what’s the chance of drawing 8 black and 2 white?  Or of drawing 3 black and 7 white?  The probability can be calculated based on the hypergeometric distribution.

The “null hypothesis” is that the students attending class are a random subset of all those enrolled in the class (that is, no particularly enrichment of A’s).  Among my sample of 26 votes, I counted 16 A’s.  The total contents of the “urn” was 53 A’s and 117-53 grades that were not an A.

I calculated this using the dhyper function in the statistical programming environment R.  Here’s the command:

> sum(dhyper(16:26,53,117-53,26))
[1] 0.04829361

The arguments are:

  • 16:26 (the number of sampled students who got an A up to the total number sampled; I use the range to find out the probability of finding 16 or more A students in a sample of 26)
  • 53 (the total number of students in the class who got an A)
  • 117 – 53 (the total number of students in the class who didn’t get an A)
  • 26 (the sample size)

The result says the 0.048 probability of the hypergeometric distribution produces an enrichment of 16 A’s or more out of 26 samples.  So, at a threshold of p<0.05, I reject the null hypothesis and conclude that students attending class are more likely to score higher on exams.  This is not a cause-and-effect proof that attending class will get you a higher grade.  But it may be worth considering!  And certainly if you did not get an A and want to do better, it might help to attend class!

Tuxedo2 Pipeline on Perceval

The new Tuxedo RNAseq pipeline (HISAT2->StringTie->Ballgown) can be set up to take good advantage of the parallelization available on the Perceval high performance computing cluster.  Several steps in the process are run independently for each RNA sample.  This document gives an example of one efficient method for using the power of SLURM arrays to run these parallel steps with a set of simple, generalizable scripts.

Recent presentation covering HPC and basic SLURM steps.

 Required Software

The Perceval cluster has a pre-built HISAT2 module that is up to date (at this writing).  Samtools is also available as a module.  However, the Picard java package and the StringTie executable must be downloaded and installed within the user’s account.  This can be done by creating a ~/bin directory, placing desired executables in that location, and adding it to your PATH.

For this workflow, we’ll take advantage of SLURM arrays.  A script is created with the appropriate commands with sbatch to specify resources required for each job.  The array feature uses a variable to specify a range of input files.  Given these components, SLURM will create a separate cluster job for each member of the array.  Using this feature you can write a single script file to execute as many jobs as you have samples.

On the Perceval cluster, use the /scratch/user space for all data and transfers within your jobs.  This means loading your raw fastq files here, saving a genome index file here, and saving output here.  The /scratch space is much faster than /home but it is not backed up so be sure to copy your data off the cluster when you’re done (or at least to your /home space).

One important note that is unique to the Picard cluster.  Modules must be loaded when you are logged into the headnode before submitting jobs to the cluster.  So, before you start, make sure you “load module HISAT2” and “load module samtools” — see the Perceval Quick Start Guide for more information on modules and versions.

 Steps

The trick to using the array feature of SLURM is to have a set of filenames with an easily-coded numeric value that can be referenced by the script.  My strategy is as follows:

    1.  Transfer fastq files to Perceval. From Windows this is simply accomplished using FileZilla.  From Linux or Mac, use rsync or scp to use an SSH-compatible file transfer.  Store all fastq files in a convenient location—I always use a directory named “fastq”.
    2. Extract the names of every fastq file into a text file. For example:
       find $(pwd) -name "*.fastq.gz" > fastq.txt

      The $(pwd) is used to store the full directory/file structure.  Not totally needed but often convenient.

    3. Transfer the contents of the fastq.txt file to Excel. This is a simple way to create a custom ln command for each file.  There’s lots of easy ways to do this (shell script, python)—choose your favorite.  See the attached Excel file and the resulting shell script for examples.  The result is a shell script with a single ln command in each line, one for each fastq file.
    4. Run the ln script from the directory where you want the links to be—the fastq folder for example. This creates a soft link for each fastq file with a numbered name.
    5. Now you’re ready to run HISAT2. See the enclosed sbatch script named “arrayHisat.sh” (below).  Important parts of the script:
      -J jobname-not required (default is name of script)
      -c 8 – this tells SLURM how many threads your program will run. It should match the –p value in the command line.
      -array=1-15 – this is the key element for setting up the array of commands.  In my case I want to run 15 individual commands, replacing the “$(SLURM_ARRAY_TASK_ID)” variable with each number in the range.
      -mail… – If you use these two commands the system will email you when the job is done.  I generally issue the execution command and then logout and wait for the email.  If you don’t want this just delete these two lines.

      #!/bin/bash
      
      #SBATCH -J hisat #jobname (optional) 
      #SBATCH -n 1 #nodes
      #SBATCH -c 8 #cpu threads per node
      #SBATCH -t 3:59:00 #estimated run time
      #SBATCH -p main #partition--just use main
      #SBATCH --export=ALL
      #SBATCH -o /home/user/output/hisat-%A-%a.out #where to put output if any
      #SBATCH -e /home/user/output/hisat-%A-%a.err #where to put error output
      #SBATCH --array=1-15  #range of numbers to use for the array.  In my case I have 15 samples.
      #SBATCH --mail-type=END,FAIL  #send an email on end or if failure
      #SBATCH --mail-user=user@rutgers.edu #where to send email
      
      hisat2 -t -p 8 --known-splicesite-infile /scratch/user/data/hg19-splicesites.txt -x /scratch/user/genomes/hisat2/hg19/genome -1  /scratch/user/data/lfastq/fastq"${SLURM_ARRAY_TASK_ID}"_R1.fastq.gz -2 /scratch/user/data/lfastq/fastq"${SLURM_ARRAY_TASK_ID}"_R2.fastq.gz -S /scratch/user/data/aligned"${SLURM_ARRAY_TASK_ID}".sam
    6. You may wish to edit the hisat2 command line to suit your preferences. Here’s the key features that I chose:
      -t – Report execution time in output.  Not needed for anything. I just like to know.
      -p 8 – Use 8 threads per node.  I haven’t tested more or less threads.  See the Perceval partitions to see how many CPUs per node you can use.
      –known-splicesite-infile – This is optional, but I think you’ll probably want to use this feature.  It allows you to load a text file containing every known splice site from the reference GTF file for your genome assembly.  As opposed to TopHat, using this reference splice map is useful whether you with to search for novel assemblies or not.  You create this file using the hisat2_extract_splice_sites.py python script that comes with HISAT2.  See the HISAT2 documentation.
      -x – location of the index files.  For most genomes there are pre-built indices on the HISAT2 website.  Download them to Perceval and store them in your scratch space.
      -1 and -2 point to the names of the paired-end fastq.gz files (R1 and R2, respectively).  Since we’ve previously created soft links varying only by the sample number, we can specify the full range of names using the “$(SLURM_ARRAY_TASK_ID)” variable.
      -S – location and name for the output SAM file created by HISAT2.
    7. If you want to use HISAT2 with Cufflinks instead of StringTie, add this option to your command:
      –dta-cufflinks
    8. Store the arrayHisat.sh script someplace convenient. Since I modify these scripts for every experiment I don’t put them in my ~/bin directory.  I keep a “scripts” directory just below my working directory for the project.
    9. Before running, remember to load HISAT2 local installation with the “module load HISAT2” command.  Then issue the command to send the script to SLURM:
       sbatch scripts/arrayHisat.sh
    10. You can monitor progress with:
       squeue –u <username>

      or

       squeue –j <jobnumber>
    11. When HISAT2 is complete for every sample, you should have one SAM file for each sample (or pair of fastq files).  These need to be converted to BAM files, sorted, and indexed.  Use the arraySamtools.sh script (below).  Similar in elements to the arrayHisat.sh script, SBATCH commands differ only in the job name and the elimination of the –c flag, because samtools is not multi-threaded.  Each of the three steps is run as a separate command inside the same script.  Again, array-numbered file names are used to keep each sample separate.  New to samtools 1.3.1 is the requirement to specify the temp file name.  This is important to keep individual sample temp files separate, so use the sample number (from the array variable) in each name.  Use sbatch to issue the script to SLURM and wait for the job to be completed.  When you’re satisfied that the sorted, indexed BAM files have been created you can delete the SAM files and the original, unsorted BAM files. Don’t forget to “module load samtools” before submitting this.
      #!/bin/bash
      
      #SBATCH -J samtools
      #SBATCH -n 1
      #SBATCH -t 3:59:00
      #SBATCH -p main
      #SBATCH --export=ALL
      #SBATCH -o /home/user/data/output/samtools-%A-%a.out
      #SBATCH -e /home/user/data/output/samtools-%A-%a.err
      #SBATCH --array=1-15
      #SBATCH --mail-type=END,FAIL
      #SBATCH --mail-user=user@rutgers.edu
      
      samtools view -b -S /scratch/user/data/results/aligned"${SLURM_ARRAY_TASK_ID}".sam > /scratch/user/data/results/aligned"${SLURM_ARRAY_TASK_ID}".bam
      samtools sort -T /scratch/user/xxx"${SLURM_ARRAY_TASK_ID}" -o /scratch/user/data/results/aligned"${SLURM_ARRAY_TASK_ID}".sorted.bam /scratch/user/data/results/aligned"${SLURM_ARRAY_TASK_ID}".bam 
      samtools index /scratch/user/data/results/aligned"${SLURM_ARRAY_TASK_ID}".sorted.bam
      
    12. In some cases, if there is a concern that the RNAseq library has been over-amplified, you can eliminate duplicate reads using the Picard java package. See the example arrayPicard.sh script.  Script elements are similar to previous ones.  Don’t forget to “module load java” if you haven’t already before running this script.  I run this step as a separate script because it is not always part of the standard pipeline, but it could easily be added to the previous script.
      #!/bin/bash
      
      #SBATCH -J picard
      #SBATCH -n 1
      #SBATCH -t 3:59:00
      #SBATCH -p main
      #SBATCH --export=ALL
      #SBATCH -o /home/user/data/picard-%A-%a.out
      #SBATCH -e /home/user/data/picard-%A-%a.err
      #SBATCH --array=1-15
      #SBATCH --mail-type=END,FAIL
      #SBATCH --mail-user=user@rutgers.edu
      
      java -Xmx4g -jar /home/user/bin/picard.jar MarkDuplicates INPUT=/scratch/user/data/aligned"${SLURM_ARRAY_TASK_ID}".sorted.bam OUTPUT=/scratch/user/data/aligned"${SLURM_ARRAY_TASK_ID}".uniq.bam METRICS_FILE=/scratch/user/data/aligned"${SLURM_ARRAY_TASK_ID}".metrics REMOVE_DUPLICATES=true ASSUME_SORTED=true
      
    13. Now we’re ready to assemble transcripts with StringTie. In this first instance of StringTie, each of the BAM files (in this case the sorted, duplicates-removed versions) is used, along with a reference GTF file, to assemble aligned reads into transcripts.  This is done whether you intend to find novel transcripts or not.  The arrayStringtie.sh script includes the standard form of the command—you may choose other flags as needed.  In this example, I’ve specified 8 CPUs per job (and this must match the SBATCH –c 8 command).  The output from this step is a GTF file derived from the sample-specific alignments.
      #!/bin/bash
      
      #SBATCH -J stringtie
      #SBATCH -n 1
      #SBATCH -c 8
      #SBATCH -t 3:59:00
      #SBATCH -p main
      #SBATCH --export=ALL
      #SBATCH -o /home/user/data/stringtie-%A-%a.out
      #SBATCH -e /home/user/data/stringtie-%A-%a.err
      #SBATCH --array=1-15
      #SBATCH --mail-type=END,FAIL
      #SBATCH --mail-user=user@rutgers.edu
      
      /home/user/bin/stringtie/stringtie /scratch/user/data/aligned"${SLURM_ARRAY_TASK_ID}".uniq.bam -o /scratch/user/data/aligned"${SLURM_ARRAY_TASK_ID}".uniq.gtf -p 8 -G /scratch/user/data/hg19-genes.gtf 
      
    14. Now we re-run StringTie with the –merge flag. This is only done once for each experiment, not once for each sample, so formally it doesn’t need to be run on the cluster, but it’s convenient to use the same computing power and file locations in a simple script.  My sample is named stMerge.sh.  Create a text file with the names of all the custom GTF files and use this as input (use the find command as we did for the fastq files above).  Output is a single, merged GTF file.
      #!/bin/bash
      
      #SBATCH -J stMerge
      #SBATCH -n 1
      #SBATCH -c 8
      #SBATCH -t 3:59:00
      #SBATCH -p main
      #SBATCH --export=ALL
      #SBATCH -o /home/user/data/stMerge-%A-%a.out
      #SBATCH -e /home/user/data/stMerge-%A-%a.err
      #SBATCH --mail-type=END,FAIL
      #SBATCH --mail-user=user@rutgers.edu
      
      /home/user/bin/stringtie/stringtie --merge -o /scratch/user/data/merged.uniq.gtf -p 8 -G /scratch/user/data/hg19-genes.gtf /scratch/user/data/gtfList.txt
      
    15. The third and final use of StringTie takes this merged GTF file and each BAM file to output quantity measurements. Since we plan to pick up these output files in Ballgown, we’ll send results to a directory named “ballgown” and place each sample’s output in a sub-directory of this.  To tell StringTie to do this step, use the –e and –B flags.  Again, other flags are available—see the StringTie documentation.  See my sample ebStringtie.sh script.
      #!/bin/bash
      
      #SBATCH -J ebStringtie
      #SBATCH -n 1
      #SBATCH -c 8
      #SBATCH -t 3:59:00
      #SBATCH -p main
      #SBATCH --export=ALL
      #SBATCH -o /home/user/data/ebStringtie-%A-%a.out
      #SBATCH -e /home/user/data/ebStringtie-%A-%a.err
      #SBATCH --array=1-15
      #SBATCH --mail-type=END,FAIL
      #SBATCH --mail-user=user@rutgers.edu
      
      
      /home/user/bin/stringtie/stringtie -e -B -p 8 -G /scratch/user/data/merged.uniq.gtf -o /scratch/user/data/aligned"${SLURM_ARRAY_TASK_ID}"/aligned"${SLURM_ARRAY_TASK_ID}".gtf /scratch/user/data/aligned"${SLURM_ARRAY_TASK_ID}".uniq.bam 
      
    16. When this step is complete you can use FileZilla to transfer the entire ballgown directory back to your desktop. Fire up R and load the ballgown package.
    17. Alternatively, you can extract count data and proceed directly to edgeR or DESeq2.  Use the prepDE.py script provided on the StringTie web page.  I’ve successfully loaded the output from this script into DESeq2 for differential analysis and modeling.

Bioinformatics in epigenetic studies of alcohol abuse

We reviewed studies of alcohol-regulated epigenetic changes in neural stem cells as a model of fetal alcohol syndrome.  Our focus was on the technology used to assess changes in DNA methylation, including the analysis of data.  Based on advances in methods, we recommend using whole-genome or sequence-selected genome sequencing using bisulfite conversion.  Data analysis methods have evolved but mainly focus on comparing differentially methylated regions.

The article appeared in the October 2016 issue of Current Pharmacology Reports.  For some reason it has not yet made it into PubMed.

Modeling nicotine addiction with stem cells

Today our new paper on nicotine addiction appeared in Scientific Reports.  We found that excitatory neurons made from subjects carrying the rs16969968 SNP variant in the CHRNA5 gene (encoding nAChR α5 subunit) exhibited an increased response to lower doses of nicotine that rapidly desensitized as the concentration of nicotine was increased.  We used induced pluripotent stem cells made from human lymphocytes stored in a genetics repository.  Our results are consistent with the idea that people carrying this risk variant of CHRNA5 (called N398) may respond more strongly to initial exposure to nicotine but then may have a harder time reaching this same response with subsequent doses.

The initial version on Scientific Report’s website used an old version of the Supplemental Materials.  Here is the updated version, including active links to supporting Supplemental Table 3 and Supplemental Table 4.

RNAseq data for this publication are available on GEO under accession number GSE56398.