I’m trying to use porechop on several data with a Snakemake workflow.
In my Snakefile, there are three rules, a fastqc rule and a porechop rule, in addition to the all rule. The fastqc rule works very well, I have all three out for my three fastq. But for porechop, instead of running the command three times, it runs the command once with the -i flag for all three files at the same time:
Error in rule porechop: jobid: 2 output: /ngs/prod/nanocea_project/test/prod/porechop/25022021_2_pore.fastq.gz, /ngs/prod/nanocea_project/test/prod/porechop/02062021_1_pore.fastq.gz, /ngs/prod/nanocea_project/test/prod/porechop/02062021_2_pore.fastq.gz conda-env: /ngs/prod/nanocea_project/test/.snakemake/conda/a72fb141b37718b7c37d9f32d597faeb shell: porechop -i /ngs/prod/nanocea_project/test/reads/25022021_2.fastq.gz /ngs/prod/nanocea_project/test/reads/02062021_1.fastq.gz /ngs/prod/nanocea_project/test/reads/02062021_2.fastq.gz -o /ngs/prod/nanocea_project/test/prod/porechop/25022021_2_pore.fastq.gz /ngs/prod/nanocea_project/test/prod/porechop/02062021_1_pore.fastq.gz /ngs/prod/nanocea_project/test/prod/porechop/02062021_2_pore.fastq.gz -t 40 --discard_middle (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
However, when I use it with a single sample, the program works.
Here my code:
import glob import os ###Global Variables### FORMATS=["zip", "html"] DIR_FASTQ="/ngs/prod/nanocea_project/test/reads" ###FASTQ Files### def list_samples(DIR_FASTQ): SAMPLES=[] for file in glob.glob(DIR_FASTQ+"/*.fastq.gz"): base=os.path.basename(file) sample=(base.replace('.fastq.gz', '')) SAMPLES.append(sample) return(SAMPLES) SAMPLES=list_samples(DIR_FASTQ) ###Rules### rule all: input: expand("/ngs/prod/nanocea_project/test/stats/fastqc/{sample}_fastqc.{ext}", sample=SAMPLES, ext=FORMATS), expand("/ngs/prod/nanocea_project/test/prod/porechop/{sample}_pore.fastq.gz", sample=SAMPLES) rule fastqc: input: expand(DIR_FASTQ+"/{sample}.fastq.gz", sample=SAMPLES) output: expand("/ngs/prod/nanocea_project/test/stats/fastqc/{sample}_fastqc.{ext}", sample=SAMPLES, ext=FORMATS) threads: 16 conda: "envs/fastqc.yaml" shell: "fastqc {input} -o /ngs/prod/nanocea_project/test/stats/fastqc/ -t {threads}" rule porechop: input: expand(DIR_FASTQ+"/{sample}.fastq.gz", sample=SAMPLES) output: expand("/ngs/prod/nanocea_project/test/prod/porechop/{sample}_pore.fastq.gz", sample=SAMPLES) threads: 40 conda: "envs/porechop.yaml" shell: "porechop -i {input} -o {output} -t {threads} --discard_middle"
Do you have any idea what’s wrong?
Thanks !
Advertisement
Answer
This question comes up often… If you use expand()
in input:
or output:
then you are feeding the rule with a list of all the files. That is the same as writing:
input: ['sample1.fastq', 'sample2.fastq', ..., 'sampleN.fastq'], output: ['sample1.pore.fastq', 'sample2.pore.fastq', ..., 'sampleN.pore.fastq'],
To run the rule on each input/output just remove the expand:
rule porechop: input: DIR_FASTQ+"/{sample}.fastq.gz" output: "/ngs/prod/nanocea_project/test/prod/porechop/{sample}_pore.fastq.gz",