I’m trying to use porechop on several data with a Snakemake workflow.
In my Snakefile, there are three rules, a fastqc rule and a porechop rule, in addition to the all rule. The fastqc rule works very well, I have all three out for my three fastq. But for porechop, instead of running the command three times, it runs the command once with the -i flag for all three files at the same time:
JavaScript
x
8
1
Error in rule porechop:
2
jobid: 2
3
output: /ngs/prod/nanocea_project/test/prod/porechop/25022021_2_pore.fastq.gz, /ngs/prod/nanocea_project/test/prod/porechop/02062021_1_pore.fastq.gz, /ngs/prod/nanocea_project/test/prod/porechop/02062021_2_pore.fastq.gz
4
conda-env: /ngs/prod/nanocea_project/test/.snakemake/conda/a72fb141b37718b7c37d9f32d597faeb
5
shell:
6
porechop -i /ngs/prod/nanocea_project/test/reads/25022021_2.fastq.gz /ngs/prod/nanocea_project/test/reads/02062021_1.fastq.gz /ngs/prod/nanocea_project/test/reads/02062021_2.fastq.gz -o /ngs/prod/nanocea_project/test/prod/porechop/25022021_2_pore.fastq.gz /ngs/prod/nanocea_project/test/prod/porechop/02062021_1_pore.fastq.gz /ngs/prod/nanocea_project/test/prod/porechop/02062021_2_pore.fastq.gz -t 40 --discard_middle
7
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
8
However, when I use it with a single sample, the program works.
Here my code:
JavaScript
1
50
50
1
import glob
2
import os
3
4
###Global Variables###
5
6
FORMATS=["zip", "html"]
7
DIR_FASTQ="/ngs/prod/nanocea_project/test/reads"
8
9
###FASTQ Files###
10
11
def list_samples(DIR_FASTQ):
12
SAMPLES=[]
13
for file in glob.glob(DIR_FASTQ+"/*.fastq.gz"):
14
base=os.path.basename(file)
15
sample=(base.replace('.fastq.gz', ''))
16
SAMPLES.append(sample)
17
return(SAMPLES)
18
19
SAMPLES=list_samples(DIR_FASTQ)
20
21
###Rules###
22
rule all:
23
input:
24
expand("/ngs/prod/nanocea_project/test/stats/fastqc/{sample}_fastqc.{ext}", sample=SAMPLES, ext=FORMATS),
25
expand("/ngs/prod/nanocea_project/test/prod/porechop/{sample}_pore.fastq.gz", sample=SAMPLES)
26
rule fastqc:
27
input:
28
expand(DIR_FASTQ+"/{sample}.fastq.gz", sample=SAMPLES)
29
output:
30
expand("/ngs/prod/nanocea_project/test/stats/fastqc/{sample}_fastqc.{ext}", sample=SAMPLES, ext=FORMATS)
31
threads:
32
16
33
conda:
34
"envs/fastqc.yaml"
35
shell:
36
"fastqc {input} -o /ngs/prod/nanocea_project/test/stats/fastqc/ -t {threads}"
37
38
rule porechop:
39
input:
40
expand(DIR_FASTQ+"/{sample}.fastq.gz", sample=SAMPLES)
41
output:
42
expand("/ngs/prod/nanocea_project/test/prod/porechop/{sample}_pore.fastq.gz", sample=SAMPLES)
43
threads:
44
40
45
conda:
46
"envs/porechop.yaml"
47
shell:
48
"porechop -i {input} -o {output} -t {threads} --discard_middle"
49
50
Do you have any idea what’s wrong?
Thanks !
Advertisement
Answer
This question comes up often… If you use expand()
in input:
or output:
then you are feeding the rule with a list of all the files. That is the same as writing:
JavaScript
1
5
1
input:
2
['sample1.fastq', 'sample2.fastq', , 'sampleN.fastq'],
3
output:
4
['sample1.pore.fastq', 'sample2.pore.fastq', , 'sampleN.pore.fastq'],
5
To run the rule on each input/output just remove the expand:
JavaScript
1
7
1
rule porechop:
2
input:
3
DIR_FASTQ+"/{sample}.fastq.gz"
4
output:
5
"/ngs/prod/nanocea_project/test/prod/porechop/{sample}_pore.fastq.gz",
6
7