A previous post provide a step-by-step example for setting up a singularity container for use on the HPC (in my case, Perceval). Here we’ll see how to build a more complex singularity recipe, create a distributable container, and use it to run a few steps of Seurat as an Rscript batch file. This approach allows you to run any version of R and its packages that you need on the HPC in a secure container and also to break free of the limitations of running R on your desktop.
A key feature of singularity is the control of user-level access and security. This means that you don’t need to have any root privileges on the HPC. However, you do need to be root to build your container on a local machine. To summarize:
- All steps to build the container using the recipe file and any manual additions need to be run as root (sudo) on your local system.
- Running the container on the HPC cannot be done as root. You will use only your assigned username permissions.
The Container Recipe
I downloaded a few sample recipe files from the Singularity Hub, including one that did a nice job of setting up for running R or Rscript. You can download my recipe file from this link. I’ll highlight some of the sections to explain how each works.
Bootstrap: docker From: ubuntu:16.04 IncludeCmd: yes
Use the Docker repository to download Ubuntu Linux v. 16.04 (Xenial), which is quite stable.
Next is the %environment section which I won’t cover here–some of the lines I just copied as is from the downloaded sample file.
The %labels section seems to be somewhat arbitrary but indicates labels useful for tracking your versions, at least.
Next is %apprun and %runscript. After you’ve built the container, these describe your desired behaviors when you run “singularity run” (run the %runscript command) or “singularity run –app Rscript” (run the %apprun Rscript command). Here’s the lines from the file:
%apprun R exec R "$@" %apprun Rscript exec Rscript "$@" %runscript exec R "$@"
In each of these the “$@” means to append this command with arguments passed from the singularity run command. More about that later.
The final section is the more important and useful one. The %post section is run after you’ve bootstrapped the container with Ubuntu loaded. Here you issue standard Ubuntu commands to load all the packages you’ll need. First the system packages:
%post apt-get update apt-get install -y apt-transport-https apt-utils software-properties-common #add CRAN/Ubuntu repo, add key, then refresh apt-add-repository "deb https://cloud.r-project.org/bin/linux/ubuntu xenial/" apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 51716619E084DAB9 apt-get update apt-get install -y wget nano apt-get install -y libblas3 libblas-dev liblapack-dev liblapack3 curl apt-get install -y gcc fort77 aptitude aptitude install -y g++ aptitude install -y xorg-dev aptitude install -y libreadline-dev aptitude install -y gfortran gfortran --version apt-get install -y libssl-dev libxml2-dev libpcre3-dev liblzma-dev libbz2-dev libcurl4-openssl-dev apt-get install -y libhdf5-dev hdf5-helpers libmariadb-client-lgpl-dev apt-get install -y r-base r-base-dev R --version # installing packages from cran R --slave -e 'install.packages("devtools",repos="https://cran.rstudio.com/")' R --slave -e 'install.packages("dplyr",repos="https://cran.rstudio.com/")' R --slave -e 'install.packages("rhdr5",repos="https://cran.rstudio.com/")' R --slave -e 'install.packages("Seurat",repos="https://cran.rstudio.com/")' # installing from bioc R --slave -e 'source("https://bioconductor.org/biocLite.R"); biocLite("pachterlab/sleuth")' R --slave -e 'source("https://bioconductor.org/biocLite.R"); biocLite("cummeRbund")' # installing from 10xgenomics repo R --slave -e 'source("http://cf.10xgenomics.com/supp/cell-exp/rkit-install-2.0.0.R") '
This took me some work to figure out what steps were required and the correct order of execution.
- Start by refreshing the apt-get data from the pre-installed repositories.
- Because we will add a new repository with a secure connection (https), we need to grab the secure transport tool. Also, we’ll need to install some utils and common tools to be able to use command-line to add repositories and keys.
- Once those are installed, we can add the CRAN/Ubuntu repository using a command (instead of editing it into the /etc/apt/sources.list system file). After that, add the public key for this repository. Then refresh the repositories.
- Next we need several package that are required for R installation (which requires compilation) and some extraneous packages required by the specific R packages that we’ll load later. The wget and nano packages are included for convenience but aren’t really needed. The gfortran –version is just a safety check that fortan got installed correctly–it would throw an error if the install didn’t work. We will need the hdf5 and mariadb packages for Seurat and cummeRbund, respectively. If you don’t want all those packages you can delete some of these.
- The last apt-get installs R.
- Next we’ll use several steps to install packages in R. Using –slave eliminates most unneeded output. The -e flag says to execute the string in quotes from the command-line. I specified which repos to use just in case.
- After some cran installs, I switch to biocLite() from bioconductor for some packages. Note that you need to source (download) biocLite() each time you run an R –slave command.
- Finally, I use the 10XGenomics repository to install their cellrangerRkit package.
With this recipe, you can first create a sandbox to test it (if you wish), using this command on a local Linux system:
sudo singularity build --sandbox biocBox/ Singularity.bioconductor-make.txt
This takes a while to run on my Linux desktop, maybe ~30 minutes. When it’s done you can sudo singularity shell into the sandbox and check it. If everything works, go ahead and convert to a non-writable image file:
sudo singularity build bioconductorBox.simg biocBox/
This produced an image file that was only 569M! Transfer this to your space on the HPC. I used a “containers” folder under my /scratch space:
scp bioconductorBox.simg firstname.lastname@example.org:/scratch/user/containers/.
We’ll need a file containing R commands that can run non-interactively. That is, if you’re used to working in RStudio (as I am) you need to consider every step to make sure it will work without you typing anything. The most important part is to capture and explicitly save any graphics output into files.
The most important and possibly confusing part is where I specify the working directory. Since I plan to mount my /scratch/user under the /mnt mountpoint, the location that is actually /scratch/user/sample1 needs to be called as /mnt/sample1 when running inside the container.
I played around with using ggplot() for the ggplot2 objects but I found that the more basic pdf(), followed by the plot command, followed by dev.off() works more consistently.
Any text output will be collected into files that will be specified in my SLURM script.
At the end, I explicitly save the workspace into a standard .RData file. This allows me to download the file to my desktop, open it with RStudio, and continue data exploration. All the data are loaded, normalized and scaled. When multiple samples are combined, you should consider also deleting (rm()) all the original single-sample objects to make a smaller workspace.
Here comes the best part–running the container with your data. It’s possible to use singularity script or run from command-line if you request a compute node–this allows you to type commands as you go. But this post is about batch processing. The final SLURM script can be downloaded here.
Because we’ll be running Seurat, which is currently single-threaded, we only ask for one node, one task, but a good amount (64 Gb) of RAM. Remember to give the script a reasonable execution time estimate. The STDOUT and STDERR output are redirected to files for later review. Finally, it’s not necessary but I like to get an email when it’s done.
The first step is to manually load the singularity module. As before, the module is hidden from the module spider command because it’s currently considered to be “under development” on our system. This is designated by the dot in front of the version.
The srun command calls singularity with the run option, but then specifies which –app to run. Remember that the recipe file gives us three behaviors, two run R (either run or run –app R) and one runs Rscript.
At this point we can also mount the filespace we plan to use with a –bind command. This allows us to use /mnt to find all the files under /scratch/user.
Next is the name of the container to run.
Last is the argument we want to pass to the exec Rscript “$@” command. By the time this command is run, we’re working inside the container so we need the address after mounting the /scratch space, where I put the R script file in my containers folder. This took me several tries to figure out.
Here’s the full srun command:
srun singularity run --app Rscript --bind /scratch/user:/mnt bioconductorBox.simg /mnt/containers/autoSeurat.R
Once the data are loaded (in my example, under /scratch/user/sample1), the container image, the R script file, and the SLURM script are loaded (under /scratch/user/containers), you’re ready to go. Be sure to create an output folder under containers to catch the STDOUT and STDERR files.
You should get an email saying that it ran without errors. Your STDOUT and STDERR output are in the containers/output folder. The R script saved files are all in the working directory you specified in the R script file (/scratch/user/sample1/outs/filtered_gene_bc_matrices/mm10).