I have a workaround based upon this discussion, so I don’t think this problem is especially urgent.
However, before applying code to a larger number of folders, I would like to see if I can better understand what went wrong in an earlier version of the code.
Here is the Snakemake code:
import pandas as pd import os data = pd.read_csv("mapping_list.csv").set_index('Subfolder', drop=False) SAMPLES = data["Subfolder"].tolist() OUTPREFIXES = data["Output"].tolist() def get_input_folder(wildcards): return data.loc[wildcards.sample]["Input"] def get_output_folder(wildcards): return data.loc[wildcards.sample]["Output"] rule all: input: expand(os.path.join("{outf}","{sample}"), zip, outf=OUTPREFIXES, sample=SAMPLES) rule copy_folders: input: infolder = get_input_folder, outfolder = get_output_folder output: subfolder = directory(os.path.join("{outf}","{sample}")) resources: mem_mb=1000, cpus=1 shell: "cp -R {input.infolder} {input.outfolder}"
I think that the problem is that the {outf}
and {sample}
variables are not being defined correctly.
For example, let’s say {outf}
can be further divided into {outf-PREFIX} and {outf-SUBFOLDER}, so {outf}
is {outf-PREFIX}/{outf-SUBFOLDER}
.
Here is the error message that I am seeing, with those placeholders instead of the observed values:
Building DAG of jobs... InputFunctionException in line 22 of /path/to/Snakefile: KeyError: '{outf-SUBFOLDER}' Wildcards: outf={outf-PREFIX} sample={outf-SUBFOLDER}
In other words, the value of {sample} is not being used. I am assuming that the problem relates to the expand
command.
Instead, {outf}
and {sample}
are being defined from components that would define the full {outf}
({outf-PREFIX} and {outf-SUBFOLDER}). So, I think the problem could solved if Snakemake instead created the following mapping:
outf={outf} sample={sample}
I also encounter a similar problem with the following code:
import pandas as pd import os data = pd.read_csv("mapping_list.csv").set_index('FullOutSubfolder', drop=False) FULLOUTS = data["FullOutSubfolder"].tolist() def get_input_folder(wildcards): return data.loc[wildcards.sample]["Input"] def get_output_folder(wildcards): return data.loc[wildcards.sample]["Output"] rule all: input: expand("{sample}", sample=FULLOUTS) rule copy_folders: input: infolder = get_input_folder, outfolder = get_output_folder output: subfolder = directory("{sample}") resources: mem_mb=1000, cpus=1 shell: "cp -R {input.infolder} {input.outfolder}"
In that situation, the output folder path is being truncated as a wildcard (losing the equivalent of the original {sample}
), similar to the truncated {outf}
above.
Can anybody please explain the problem or provide any suggestions?
Thank you very much!
Sincerely,
Charles
Update (7/7/2022): I believe that there was some confusion, so I hope that the additional information helps.
Here is an example with placeholder information for 2 lines similar to what would be seen in mapping_list.csv:
FPID,Input,Output,Subfolder,FullOutSubfolder fp1,/path/to/InputFolderA/SampleA,/path/to/OutputPrefixA/OutputFolderA,SampleA,/path/to/OutputPrefixA/OutputFolderA/SampleA fp2,/path/to/InputFolderB/SampleB,/path/to/OutputPrefixB/OutputFolderB,SampleB,/path/to/OutputPrefixB/OutputFolderB/SampleB
To use that example, there are no variables called {outf-PREFIX} and {outf-SUBFOLDER}.
Instead, these are the intended values for the 1st row:
{outf}
=/path/to/OutputPrefixA/OutputFolderA
{sample}
=SampleA
and these are the values incorrectly defined by Snakemake:
{outf}
=/path/to/OutputPrefixA
{sample}
=OutputFolderA
So, my understanding is that the intended value of {sample}
is not being used, and both variables are being defined from splitting the path from {outf}
.
Advertisement
Answer
I am still curious what caused the earlier problem.
However, if somebody else encounters the same problem, then I do have a workaround.
The details are in the following post:
Snakemake: Mismatched Wildcards Variable Values for “output” Rule
Basically, I added extra shell script commands and I copied a small file into the output directly. I then used the small file as the endpoint, instead of the copied output directory.