Skip to content
Advertisement

Snakemake – How to set conda environment path

In Snakemake, conda environments can be easily set up by defining rules as such conda: "envs/my_environment.yaml". This way, YAML files specify which packages to install prior to running the pipeline.

Some software requires a path to third-party-software, to execute specific commands.

An example of this is when generating a reference index with RSEM (example from GitHub page DeweyLab – RSEM):

rsem-prepare-reference --gtf mm9.gtf 
                       --star 
                       --star-path /sw/STAR 
                       -p 8 
                       --prep-pRSEM 
                       --bowtie-path /sw/bowtie 
                       --mappability-bigwig-file /data/mm9.bigWig 
                       /data/mm9 
                       /ref/mouse_0

Can I locate or predefine the directory (e.g. [workdir]/.snakemake/conda/STAR) for the STAR aligner software, which is installed via conda in a prior rule?

Currently, one option may be to create a shared environment folder, using the Command-line interface option: --conda-prefixSnakemake docs – Command-line interface, however as this is a single-case-issue, I would prefer to define this information in the rules.

Advertisement

Answer

There are two ways that I’ve dealt with this.

1: Let Conda Handle PATH

That specific option (--star-path) only needs to be specified if STAR is not on PATH. However, if STAR is included in your YAML for this rule, then Conda will place it on PATH as part of the environment activation, and so that option won’t be needed. Same goes for --bowtie-path. Hence, for such a rule the YAML might be something like:

name: rsem
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - rsem
  - star
  - bowtie

As per this thread, consider fixing the versions on the packages up to a minor version (e.g., bowtie=1.3).

2: Use config.yaml for Pipeline Options

If for some reason you don’t want a fully self-contained pipeline, e.g., your system already has lots of standard genomics software like STAR preinstalled, then you could include an entries in your config.yaml where users should adjust the pipeline to their system. For example, here are the relevant parts:

config.yaml

star_path: /sw/STAR
bowtie_path: /sw/bowtie

Snakefile

configfile: config.yaml

## this is not a complete rule
rule rsem_prep_ref:
    # needs input, output...
    params:
        star=config['star_path'],
        bowtie=config['bowtie_path']
    threads: 8
    conda: "envs/myenv.yaml"
    shell:
        """
        rsem-prepare-reference --gtf mm9.gtf 
          --star 
          --star-path {params.star} 
          -p {threads} 
          --prep-pRSEM 
          --bowtie-path {params.bowtie} 
          --mappability-bigwig-file /data/mm9.bigWig 
          /data/mm9 
          /ref/mouse_0
        """

Really, anything your pipeline assumes already exists and is not generated by the pipeline itself should go into your config.yaml (e.g., mm9.gtf or mm9.bigWig).


Note on Sharing Environments

Generally, I advise against trying to share environments. However, you can still conserve space by sharing a package cache across users and making sure environments are created on the same filesystem (this lets Conda use hardlinks instead of copying). You can use the Conda configuration option pkgs_dirs to set package cache locations. If the pipeline itself is already on the same file system as the Conda package cache, I would just let Snakemake use the default location (.snakemake/conda) and not mess with the --conda-prefix argument.

Otherwise, you can give Snakemake the --conda-prefix argument to point to a directory on the same file system in which to create Conda environments. This should be a rather generic directory in which all environments for the pipeline get located. What was proposed in OP ([workdir]/.snakemake/conda/STAR) would not make sense.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement