An environment to manage workflows for data analysis projects, based on Conda environments and easily exportable to Docker images.
Often, computational analyses are performed on powerful dedicated servers or clusters. This means that the same machine needs to contain all the code dependencies for all of the projects. A common solution is to use environment modules, that allow one to load only the specific packages that are needed. Environment modules allows many packages (and many versions of the same package) to be installed on the computer, but only a few of them are loaded by the user at any given time, according to what is needed. This setup usually requires a system administrator to install all the modules and manage them in a centralised fashion.
Bioinfoconda adopts a different phylosophy, based on the doctrine of the
so-called 'project-centrism'. The infrastructure is decentralised so
that each project is responsible for its own software dependencies, and
anybody that works on the project can (and should) install the packages
that are needed, either through conda or with a manual compilation. In
Bioinfoconda, each project has its own directory, conda environment,
local libraries and executables. As soon as you cd
into the project
directory, the environment is automatically set up with project-specific
files. There is no need to employ a full-time system administrator,
as users can autonomously install the packages. Another important
feature of Bioinfoconda is that it provides a way to download and store
third-party data sets in an organised and self-documenting fashion.
Last, but not least, projects can also be easily incapsulated in Docker
containers.
-
Clone this repository; if you want a system-wide installation and have root permissions, you can clone it for instance directly under the root directory, while if you are an unprivileged user you can clone it under your home directory.
git clone https://github.com/fmarotta/bioinfoconda
-
Install Miniconda (you can install it in any path, but if you want to have everything in one place, install it under bioinfoconda)
- Follow the instructions here
-
Install mamba in the base conda environment.
conda install -c conda-forge mamba
-
Install direnv (if you do not have rootly powers, you can install it under bioinfoconda; there probably is a packaged version for your system)
- See here for the instructions
-
Create a group on GitLab and obtain an API token
- This will allow Bioinfoconda to automatically create a remote repository for each new project
- Read this documentation
- Edit $BIOINFO_ROOT/bioinfoconda/gitlab/config and add the username of your GitLab account, the group name, and the token. Note that the first line of this file is only a header and the second line initially contains some dummy values.
-
(Optional) Install Docker
- Instructions here
-
(Optional) Install Bioinfotree
- Bioinfotree is a set of bioinformatic tools
- If you know about Bioinfotree at all, you'll probably also know how to install it
-
Set up the environment
- Set BIOINFO_ROOT to the path where you cloned the repository
- Add bioinfoconda's executables to your PATH
- Add miniconda's executables to your PATH
- Make sure you have LC_ALL set
- Configure the bashrc so that direnv works
- (Only if you use bioinfotree) Add bioinfotree's executables to your PATH
- (Only if you use bioinfotree) Add bioinfotree's python and perl libraries to the environment
-
(Useful in a multi-user system) Create a 'bioinfoconda' group and fix the directory permissions (then add the users supposed to use bioinfoconda to the 'bioinfoconda' group)
sudo addgroup bioinfoconda
sudo chown -R root:bioinfoconda $BIOINFO_ROOT
sudo chmod -R 2775 $BIOINFO_ROOT
-
Having troubles?
- Drop an e-mail to fmarotta, he's always happy to help.
Step 5., which is probably the one that requires most work, translates into appending the following to the .bashrc's of all the users of Bioinfoconda
# Bioinfoconda configuration
# ----------------------------------------------------------------------
# Export environment variables
export LC_ALL="en_US.utf8"
export BIOINFO_ROOT="/bioinfoconda"
export PATH="$BIOINFO_ROOT/bioinfoconda/bin:$CONDA_ROOT/bin:$PATH"
# If you have bioinfotree add also the following three exports
# export PATH="$BIOINFO_ROOT/bioinfotree/local/bin:$PATH"
# export PYTHONPATH="$BIOINFO_ROOT/bioinfotree/local/lib/python:$PYTHONPATH"
# export PERL5LIB="$BIOINFO_ROOT/bioinfotree/local/lib/perl:$PERL5LIB"
# Configure direnv
eval "$(direnv hook bash)"
# Show conda environment in prompt
# NOTE: direnv is not able to properly update the prompt when the conda
environment changes, hence we need this to recover the functionality.
show_conda_env()
{
if [ -n "$CONDA_DEFAULT_ENV" ] && \
[ "$CONDA_DEFAULT_ENV" != "base" ] && \
[ "${PS1:0:1}" != "(" ]; then
echo "($CONDA_DEFAULT_ENV) "
fi
}
export -f show_conda_env
PS1='$(show_conda_env)'$PS1
If you are the system administrator (or if you have a great influence over him/her), you may as well put these instructions into /etc/skel/.bashrc, so that the configuration will be available to all new users.
Remember to edit the previous instructions according to your requirements and taste (for instance, you could need to change the path names; also make sure not to override any environmental variable which was already set for your system).
TODO: environment variables, miniconda, bioinfotree
To create a new project, simply run
X-mkprj PROJECT_NAME
This will: create a directory structure for the project; add default
Dockerfile and Snakefile; initialise a git repository; create a conda
environment; set up direnv. You can use some of X-mkprj
's options to
disable the features you do not need. Projects are created in the
$BIOINFO_ROOT/prj directory.
Thanks to direnv, each time you cd
into the project directory you will
automatically switch to the conda environment of the project and all the
local executables and libraries will be available in the PATH.
In Bioinfoconda, third-party data sets which will be used in potentially
many different projects are stored in a separate place, the
$BIOINFO_ROOT/data directory. To make the downloading of the data
easier, we created the X-getdata
command, whose syntax is
X-getdata URL
It supports http, ftp, rsync, and now also ssh protocols, and it is able
to download a directory recursively, as well as to download only those
files which match a pattern. Third-party data sets are usually well
documented and structured, therefore it makes sense to maintain their
original structure when downloading them; X-getdata
automatically
suggests a possible location where to save the downloads: for instance,
if the URL is http://foo.com/boo/bar/baz.vcf.gz, the suggestion will be
foo/boo/bar/baz.vcf.gz. If you are not satisfied with the suggestion,
you can manually override it. Files are downloaded inside the
$BIOINFO_ROOT/data directory.
For each downloaded file, X-getdata
creates an entry in a log file, so
that it will be easy to find out who downloaded a file and from where.
Conda is both an environment manager and a package manager, meaning that it allows you to install package and to create environments to keep groups of packages separate from each other. Bioinfoconda creates one environment for each project so as to ensure reproducibility. The problem with conda is that it tends to become much slower as the size of an environment increases; this has led to the development of mamba, a drop-in replacement for conda which relies on a much faster library for resolving dependencies and on parallellisation.
Bioinfoconda internally uses mamba to create the environments and it
provides a wrapper around it for convenience. Instead of using conda
or mamba
directly, we recommend to use X-conda
to install packages.
This usage is consistent with the other X- commands, and it will make
more difficult to accidentally use conda instead of mamba. Thus, to
install a new package, use X-conda install -c <channel> <pkgname>
instead of conda install -c <channel> <pkgname>
.
Furthermore, you can use X-conda also to export or regenerate an
environment. For exporting, write X-conda export
, and a list of all
the installed packages will be dumped into a file under the condafiles
directory. Regenerating an environment means to create a brand new
environment with the same name as the old one and re-install all the
packages at once; this can help resolve conflicts when the situation is
desperate. Check out X-conda regenerate --help
for the details.
When working on a project it is common to require a specific program
to perform some operations. There are two options: either an existing
program can be downloaded or a new program is to be written. In the
former case, the first thing to try is to install it with conda, i.e.
X-conda install -c channel package
. (In a new project, by default the
only packages installed in the conda environment are snakemake, r, and
perl) If the package is not contained in conda's repositories, you
can manually download it by following the maintainer's instructions.
Each new project is created with a predefined set of subdirectories,
and you should install your programs in one of those; see below for
a description of the intended purpose of each subdirectory. Always
remember to check whether you need to add the executables to the PATH
or the libraries to your environment. In such cases, you have to edit
the project's .envrc, located in the project's home, so that direnv is
aware of your configuration. When you need to write your own program
you should also put it in one of the project's subdirectories and make
sure that executables and libraries are added to the environment. An
important thing is that the shebang should be '#!/usr/bin/env XXX', not
simply '#!/usr/bin/XXX'.
Each project is created with a specific directory structure, which will now be explained. Note, however, that these are just hints, so if you think your project would benefit from another structure, feel free to adopt it.
Directly under the project's home there are two subdirectories, dataset and local. The former is the one where the actual work is done, therefore it will contain the results of the analysis; the latter contains all the scripts, configuration files and other project-related things, such as the documentation or the draft of a scientific paper. Since local has many subdirectories of its own, we provide a table to describe each of them.
Directory | Purpose |
---|---|
snakefiles | snakefiles for snakemake; they are symlinked in dataset |
dockerfiles | dockerfile which can be symlinked in the project's home |
condafiles | conda yml file with the list of programs in the environment |
benchmark | benchmark of rules, scripts, or whole pipelines |
config | config files for snakemake and other config files |
src | sources of programs written by you or downloaded |
bin | symlinks to the executables of programs whose source is in src |
lib | programs that are not executed but that contain functions called by other programs |
builds | used to builds programs from source |
data | data sets that are used only for that project |
R | R scripts |
python | python scripts |
doc | documentation, draft of the paper |
In order to fix these ideas, let us see an example. Say that we want to analyse the genome of some microbes that our colleague has just sequenced: we will map the reads to a reference genome and then call the variants.
Creating the project To create the project, we run X-mkprj microbe-genome-analysis
; now we have a directory structure and a conda
environment, so we enter the project directory with cd $BIOINFO_ROOT/prj/microbe-genome-analysis
. We are faced with two
directories, dataset and local, and we enter dataset, where we find a
Snakefile -- we like Snakemake so it is the default workflow manager,
but any alternative will work.
Downloading data First of all, we need the data. Suppose the
reference genome for the organism we are interested in is at
https://microbe-genomes.com/downloads/vers_0.5/genome.fa: we simply run
X-getdata https://microbe-genomes.com/downloads/vers_0.5/genome.fa
to
download the sequence, which will be saved at
$BIOINFO_ROOT/data/microbe-genomes/vers_0.5/genome.fa.
Then we need the sample data. While the reference genome can be useful in general for many different projects, the sample data is specific to our own project, therefore we download it in local/data/samples/A.fastq. In general, We may use wget or curl to download remote files, or simply put them in local/data if we have a local copy of the files.
Writing the workflow The snakefile is actually a symbolic link to a file in local/snakefiles. Indeed, we separate scripts and configuration files (which are in local) from results (which are in dataset).
Following the snakemake tutorial, we edit the snakefile and write:
rule bwa_map:
input:
"$BIOINFO_ROOT/data/microbe-genomes/vers_0.5/genome.fa",
"$BIOINFO_ROOT/prj/microbe-genome-analysis/local/data/samples/A.fastq"
output:
"mapped_reads/A.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
Getting software -- conda At this stage we still don't have the bwa
and samtools programs, so we need to download them. The first thing to
do is a search in anaconda repositories to see whether bwa is packaged;
it turns out that it is and we can install it with X-conda install -c bioconda bwa
. A very useful site is anaconda
cloud, which not only lets you search anaconda
packages, it also tells you which command to run if you want to install
it (check for instance the samtools
page).
Getting software -- manual compilation Similarly, samtools is
packaged in anaconda repositories, so we could install it as easily as
typing X-conda install -c bioconda samtools
; nevertheless, for the
sake of this example, let us install it manually. We browse the web and
find the samtools page at http://www.htslib.org/download/; we want to
download the source code in the project's local/src, then install
samtools under local/builds, and finally copy the binary files in
local/bin. We run:
cd $BIOINFO_ROOT/prj/microbe-genome-analysis/local/src
wget https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2
tar -xf samtools-1.9.tar.bz2
Then, following the instructions at the download page, we run
cd samtools-1.9
mkdir $BIOINFO_ROOT/prj/microbe-genome-analysis/local/builds/samtools-1.9
./configure --prefix=$BIOINFO_ROOT/prj/microbe-genome-analysis/local/builds/samtools-1.9
make
make install
We probably will need to download additional dependencies such as ncurses and bzip2. Always try to install them with conda; install them manually only as a last resort.
Missing libraries are a very common source of troubles. For instance,
when compiling a program from within a conda environment it may happen
that a library is not found in spite of it being installed. If such
library is installed in the system (i.e. under /usr
) you may fix the
problem by setting an environmental variable specifying where the
library is to be found. For instance, if you are compiling a rust
program and the libz library is not found, but it is installed under
/usr/lib/x86_64/
, then you may set the environmental variable
RUSTFLAGS=-L/usr/lib/x86_64-linux-gnu
. If you alter some environmental
variable, you should always make the change permanent by editing the
.envrc and the dockerfile of your project.
We are nearly done: we have our source code in local/src/samtools-1.9 and the built program in local/builds/samtools-1.9. It is good practice to have each program in its own directory (at least if the program is large... 30-line scripts may not deserve their own directory). Under local/builds/samtools-1.9 there is a bin directory with all the executables, so the only thing we need now is to add this directory to our PATH environmental variable. If we install many packages, we need to add many entries to the PATH. The recommended way to add some executables to the PATH, however, is not to alter the PATH variable: instead, we symlink them in the global bin directory, so that all the executables will be in one place. The project's local/bin directory is already in our PATH, therefore we just run
ln -s local/builds/bin/\* local/bin/
To recap, the steps of manual compilation are: download the source code
in local/src/program_name, configure it with the option --prefix
local/builds/program_name, build it by running make
and make install
, and finally symlink the executables to local/bin.
We are now ready to run our first snakemake rule, so we come back to the dataset directory and run
snakemake mapped_reads/A.bam
If everything worked well, we should now have some results under dataset while all the files we used to get those results are under local; by the way, this separation of results from the rest makes it easier to backup only the important things since the results can be easily obtained by re-running all the workflow. Remember that if you will export this project to another machine, manually installed modules like samtools will have to be rebuilt! This is why using conda is so much better.
Getting software -- installing libraries Now suppose we find an
excel with microbe data, which we download under local/data/extra, and
we want to parse it to enrich our analysis. We shall provide first an
example of how to download a Perl module, Spreadsheet::ParseExcel. Its
location should be under local/lib/perl5 (recall that by library we mean
a program which is not executed directly but it is called from other
perl scripts). Again, first we search anaconda cloud for a packaged
version; luckily we find it, so we can run X-conda install -c biobuilds perl-spreadsheet-parseexcel
. However, for the sake of this example, we
will not do it; instead, we are going to install it manually. For perl
modules, there are actually three options: (a) the first is to download
the source code under local/src/Spreadsheet-ParseExcel and then compile
it with perl Makefile.PL INSTALL_BASE=local/lib/perl5
; (b)
alternatively, we could use cpan install Spreadsheet-ParseExcel
; (c)
the best option, however, is to use cpanm.
Method (a) is annoying because if the module has many dependencies we
have to manually install each of them. CPAN (method (b)) is more
convenient, but it has to be configured for each user individually,
which is somehow unpleasant. The third option, CPANM, is the recommended
one: it is as powerful as CPAN, but there is virtually no configuration
to make; moreover, bioinfoconda supports it and makes sure that when you
run cpanm module
from inside the project directory, the module source
is downloaded under local/src, built under local/builds/cpanm and
installed under local/lib/perl5.
From the above discussion we can extract a general recommendation: when
installing the interpreter of a programming language such as Perl, we
should install the associated package manager as well. In this case,
since we use Perl, we install perl and cpanminus. For instance, if we
program in Node.js we would install node and npm. Of course when we say
install we actually mean X-conda install
.
For python modules, it suffices to run pip install modulename
; they
are handled by conda itself. For R modules (after making sure that the
package is not in conda), just run R at the console and type
install.package("pkgname")
; if it is a bioconductor package, type
source("http://bioconductor.org/biocLite.R")
biocLite()
biocLite(c("pkgname1", "pkgname2"))
This should cover most common cases; if you cannot install your module in any of those ways, probably you'd better write your own library :)
Getting software -- writing it Now that we have our module, we can start writing our own script to parse the excel file. We start editing the file local/src/parse_microbe_samples.pl:
#!/usr/bin/env perl
use strict;
use warnings;
use Spreadsheet::ParseExcel;
...
Note first that we used a special shebang and secondly that we do not
need to specify the local lib directory where we downloaded the
Spreadsheet module: bioinfoconda takes care of that. Indeed, thanks to
direnv, when we work at this project the environment variable PERL5LIB
is set to local/lib/perl5. (for other programming language it works
similarly.) When we are done we should make sure that this script is in
the PATH environment variable, so we simply link it under
local/bin/parse_microbe_sample
.
We now come back to the dataset directory and write a second snakerule:
rule parse_excel:
input:
"$BIOINFO_ROOT/prj/microbe-genome-analysis/local/data/extra/microbe_data.xls"
output:
"microbe_data/summary.txt"
shell:
"parse_microbe_samples {input} > {output}
Note also that we don't need to specify the path to our perl script since it is under local/bin and local/bin is in the PATH environment variable.
Running the workflow TODO
Having one conda environment per project is great for small pipelines, but as soon as the number of installed packages increases, the resolution of the environment starts to become slower than time near a black hole. Furthermore, it seems unreasonable to bloat the project environment with a package that is needed for just one rule; not to mention that for bioconductoR packages the resolution is often impossible.
In order to avoid the spaghettification of the brain (a concrete risk when waiting for the resolution of an environment), it is recommended to create multiple conda environments for each project: one 'main' environment with the indispensable tools (e.g. Snakemake, R, Rstudio), and then other sub-environments, one for each part of the analysis. Users are encouraged to use their best judgement when it comes to determining how many parts a project has. But for example, one part could be labelled 'alignment,' and it would contain all the software required for the alignment; another part could be called 'quality control;' yet another could be 'statistical analysis,' or 'plots.' Snakemake makes it easy to use the appropriate conda environment for each rule: check this out here.
- Always use relative links to access project-related files from within the project
-
Install the rstudio package from conda for the project where you need it.
-
From a directory within the project, run the command
rstudio
. If you are working on a remote server, make sure that you connected withssh -X ...
(the -X option allows the server to forward graphical applications like Rstudio to the client). -
Once Rstudio is up, create a project from an existing directory and choose the root directory of your project. This will make sure that Rstudio knows where to find the libraries and the Rprofile.
occam, the supercomputer of the University of Turin, is the main external cluster used by the bioinfoconda community. (If you use another service, such as AWS, and you want to contribute the instructions for your favorite platform, by all means make a PR or email the maintainer.)
Using occam is surprisingly difficult and many things are not explained in the official documentation. Here we provide a step-by-step guide for using occam the Bioinfoconda way. Before starting, however, please make sure you have a basic understanding of Docker and occam by reading the respective manuals and tutorials.
It is helpful to classify the files that are involved in the pipeline into three categories: input, output, and intermediate. To the output caterogy belong all those files which we need but we don't have; ideally there should be a single Snakemake rule (or equivalent) to generate all the output files at once. The intermediate files are those that are automatically generated by Snakemake; we need not worry about them. And the input files are the ancestors of all the files in the analysis; typically they have been obtained by our wet-lab colleagues, or downloaded from the internet. Note that input does not refer to the input section of a single rule, but it refers to the set of all files that are not generated by any rule or script; they are the files that go in the local/data directory of the project, or in the /bioinfo/data, or even those files that are inside other projects.
Moreover, it is helpful to keep in mind the distinction between: your local machine, where you write and run your pipelines on an everyday basis; occam, the supercomputer; and the docker container, a virtual machine that can run everywhere.
-
Write and test the pipeline on your local machine. Make sure that you can obtain the result that you want by running just one command. For example, if you use Snakemake, make sure that the dependencies are resolved correctly and there is a rule to make everything; if you don't use Snakemake or similar software, you could write a custom script that will execute all the steps in the pipeline. The important thing is that the target file and all its dependencies be created with just one command, starting from a limited set of input files.
-
Export the environment by running
X-conda export
(It is good practice to do this every time you install something with conda, but it is mandatory before building the docker image.) -
There is a default local/docker/Dockerfile inside each project, but you will need to do some editing:
- the ymlfile path must be changed to match the one you have exported in step 2;
- if you manually created new environmental variables you have to add them to the dockerfile;
- you may have to incorporate in the Dockerfile other manual changes you did: always think about that.
-
cd
into the project directory and build the docker image by running:docker build -t "gitlab.c3s.unito.it:5000/user/project_name" -f local/dockerfiles/Dockerfile .
Note that the last dot '.' is part of the command. Replace 'user' with your username on occam (not on the local machine) and 'project_name' with an arbitrary name (it helps if it is related to your project's name, like for instance 'alignment_v2'). Important: the content of the dataset directory is not part of the docker image, therefore, if some of the input files are inside dataset, you'll need to mount them as volumes; the master Snakefile, in particular, should always be mounted.
-
Test the image locally. There are many things that can be tested, but a basic thing would be to run
docker run -it "gitlab.c3s.unito.it:5000/user/project_name"
and explore the docker container to see if everything is where it should be. You could also try and mount the volumes (see later steps) and run the snakemake command to create your target.
-
Create a repository on occam's GitLab (see occam HowTo). It should be called "project_name" as in the previous step.
-
Push the image to occam registry by running the following two commands:
docker login "gitlab.c3s.unito.it:5000" docker push "gitlab.c3s.unito.it:5000/user/project_name"
-
Log in to occam website and Use the Resource Booking System to reserve a slot. You will need to choose which node(s) to reserve and then estimate the time it will take to compute your pipeline using the chosen resources. Tips:
- If you book 2 light nodes, you don't have 48 cores; you will have to run 2 separate containers, each with 24 cores.
- The reservation can be updated (or deleted) only before its official start.
- If you reserve two consecutive time slots for the same machine, the container will not be interrupted.
-
Log in to occam (see the HowTo) and
cd
into /scratch/home/user/, then create a directory called project_name (it's not important that it be called like that, but it helps). -
Now it's time for the hard part: docker volumes. You'll have to think of all the input files that your target needs. Suppose that, on your local machine, the target is /bioinfo/prj/project_name/dataset/analysis/correlation.tsv, while the input files are located under four different directories:
- /bioinfo/data/ucsc/chromosomes,
- /bioinfo/prj/project_name/local/data/annotation,
- /bioinfo/prj/otherproject/dataset/expression,
- /bioinfo/prj/project_name/dataset/alignments
You may need the last directory if, for instance, you have already run the alignment rules on your local machine, and now you just need to compute the correlation, without re-doing the alignment. In this situation, my reccommendation is to do as follows. From occam,
cd
into /scratch/home/user/project_name and create four directories:mkdir chromosomes
,mkdir annotation
,mkdir expression
,mkdir -p dataset/aligments
.
Then, use
rsync
(or your favorite equivalent tool) to copy the files from your local machine to occam:rsync -a user@localmachineIP:/bioinfo/data/ucsc/chromosomes/* chromosomes
rsync -a user@localmachineIP:/bioinfo/prj/project_name/local/data/annotation/* annotation
rsync -a user@localmachineIP:/bioinfo/otherproject/dataset/expression/* expression
rsync -a user@localmachineIP:/bioinfo/prj/project_name/dataset/alignments/* dataset/alignments
Lastly, copy the master Snakefile (this has to be done even if you don't mount any volume):
rsync -a user@localmachineIP:/bioinfo/prj/project_name/dataset/Snakefile dataset/
-
Test the image on occam, mounting all the volumes. In a sense, this step is the opposite of the previous one: in the previous, we copied the directories from the local machines to occam, now we mount these directories in the docker container, but we mount them at the same paths that they have in the local machine. I suggest to
cd
into /archive/home/user and create a directory called like the title of the reservation that you have created through occam's resource booking system, and append the current date. For instance,mkdir projectX_correlation_2020-09-13
. Then,cd
into the new directory and run:occam-run \ -v /scratch/home/user/project_name/chromosomes:/bioinfo/data/ucsc/chromosomes \ -v /scratch/home/user/project_name/annotation:/bioinfo/prj/project_name/local/data/annotation \ -v /scratch/home/user/project_name/expression:/bioinfo/prj/otherproject/dataset/expression \ -v /scratch/home/user/project_name/dataset:/bioinfo/prj/project_name/dataset \ -t user/project_name \ "snakemake -np target"
The above command will start running the container and executing the command to make the target (here assuming that you use snakemake). Please note that the quotes ("") are important. The container will run in occam's node22, which is a management node reserved for testing purposes only, so we cannot run the whole analysis on this node. That's why we use snakemake with the
-n
option. When the above command exits, after a bit you will find three files called something like node22-5250.log, node22-5250.err, and node22-5250.done. Inspect their content and see if everything is OK. In particular, the .log file should contain the output ofsnakemake -np target
. If something is wrong, try to fix the problem and repeat. -
When the time of your reservation comes, run the same command as before from the same directory as before, but add the option
-n nodeXX
. For instance, if you reserved node 17, writeoccam-run \ -n node17 \ -v /scratch/home/user/project_name/chromosomes:/bioinfo/data/ucsc/chromosomes \ -v /scratch/home/user/project_name/annotation:/bioinfo/prj/project_name/local/data/annotation \ -v /scratch/home/user/project_name/expression:/bioinfo/prj/otherproject/dataset/expression \ -v /scratch/home/user/project_name/dataset:/bioinfo/prj/project_name/dataset \ -t user/project_name \ "snakemake -np target"
-
Remember that if you have booked multiple nodes, you will need to run
occam-run
multiple times, once for each node. It is up to you to split the targets and mount the volumes appropriately. GOOD LUCK.
-
Log everything to journald
-
Possibly send mails to bioinfoadmin
-
Make the names coherent (bioinfo vs bioinfoconda...)
-
Document the templates