In Snakemake, conda environments can be easily set up by defining rules as such conda: "envs/my_environment.yaml"
. This way, YAML files specify which packages to install prior to running the pipeline.
Some software requires a path to third-party-software, to execute specific commands.
An example of this is when generating a reference index with RSEM (example from GitHub page DeweyLab – RSEM):
rsem-prepare-reference --gtf mm9.gtf --star --star-path /sw/STAR -p 8 --prep-pRSEM --bowtie-path /sw/bowtie --mappability-bigwig-file /data/mm9.bigWig /data/mm9 /ref/mouse_0
Can I locate or predefine the directory (e.g. [workdir]/.snakemake/conda/STAR
) for the STAR
aligner software, which is installed via conda in a prior rule?
Currently, one option may be to create a shared environment folder, using the Command-line interface option: --conda-prefix
Snakemake docs – Command-line interface, however as this is a single-case-issue, I would prefer to define this information in the rules.
Advertisement
Answer
There are two ways that I’ve dealt with this.
1: Let Conda Handle PATH
That specific option (--star-path
) only needs to be specified if STAR is not on PATH. However, if STAR is included in your YAML for this rule, then Conda will place it on PATH as part of the environment activation, and so that option won’t be needed. Same goes for --bowtie-path
. Hence, for such a rule the YAML might be something like:
name: rsem channels: - conda-forge - bioconda - defaults dependencies: - rsem - star - bowtie
As per this thread, consider fixing the versions on the packages up to a minor version (e.g., bowtie=1.3
).
2: Use config.yaml
for Pipeline Options
If for some reason you don’t want a fully self-contained pipeline, e.g., your system already has lots of standard genomics software like STAR preinstalled, then you could include an entries in your config.yaml
where users should adjust the pipeline to their system. For example, here are the relevant parts:
config.yaml
star_path: /sw/STAR bowtie_path: /sw/bowtie
Snakefile
configfile: config.yaml ## this is not a complete rule rule rsem_prep_ref: # needs input, output... params: star=config['star_path'], bowtie=config['bowtie_path'] threads: 8 conda: "envs/myenv.yaml" shell: """ rsem-prepare-reference --gtf mm9.gtf --star --star-path {params.star} -p {threads} --prep-pRSEM --bowtie-path {params.bowtie} --mappability-bigwig-file /data/mm9.bigWig /data/mm9 /ref/mouse_0 """
Really, anything your pipeline assumes already exists and is not generated by the pipeline itself should go into your config.yaml
(e.g., mm9.gtf
or mm9.bigWig
).
Note on Sharing Environments
Generally, I advise against trying to share environments. However, you can still conserve space by sharing a package cache across users and making sure environments are created on the same filesystem (this lets Conda use hardlinks instead of copying). You can use the Conda configuration option pkgs_dirs
to set package cache locations. If the pipeline itself is already on the same file system as the Conda package cache, I would just let Snakemake use the default location (.snakemake/conda
) and not mess with the --conda-prefix
argument.
Otherwise, you can give Snakemake the --conda-prefix
argument to point to a directory on the same file system in which to create Conda environments. This should be a rather generic directory in which all environments for the pipeline get located. What was proposed in OP ([workdir]/.snakemake/conda/STAR
) would not make sense.