Within my snakemake pipeline I’m trying to retrieve the correct wildcards. I’ve looked into wildcard_constraints and this post and this post, however I can’t figure out the exact solution.
Here’s an example of file names within 2 datasets. 1 dataset contains paired mouse RNAseq read files and another dataset contains human paired RNAseq read files.
“Mus_musculus” dataset is “PRJNA362883_GSE93946_SRP097621” with file names:
“SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_1.fastq.gz” “SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_2.fastq.gz”
“Homo_sapiens” dataset is “PRJNA362883_GSE93946_SRP097621” with file names:
“SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_1.fastq.gz” “SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_2.fastq.gz”
I would like glob_wildcards to spit out the following wildcards
> ['PRJNA362883_GSE93946_SRP097621', 'PRJNA362883_GSE93946_SRP097621', > 'PRJNA493818_GSE120639_SRP162872', 'PRJNA493818_GSE120639_SRP162872'] > ['SRR5195524_GSM2465521_KrasT_45649_NoDox', > 'SRR5195524_GSM2465521_KrasT_45649_NoDox', > 'SRR7942395_GSM3406786_sAML_Control_1', > 'SRR7942395_GSM3406786_sAML_Control_1'] ['Mus_musculus', 'Mus_musculus', > 'Homo_sapiens', 'Homo_sapiens'] ['1', '2', '1', '2']
I’ve tried the following code:
> import glob import os > > DATASET,SAMPLE,SPECIES,FRR, > =glob_wildcards(config["project_path"]+"resources/raw_datasets/{dataset}/{sample}_{species}_RNA-Seq_{frr}.fastq.gz") > print(DATASET,SAMPLE,SPECIES,FRR)
However, I get this as output. The many underscores messes up the glob_wildcards
> ['PRJNA362883_GSE93946_SRP097621', 'PRJNA362883_GSE93946_SRP097621', > 'PRJNA493818_GSE120639_SRP162872', 'PRJNA493818_GSE120639_SRP162872'] > ['SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus', > 'SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus', > 'SRR7942395_GSM3406786_sAML_Control_1_Homo', > 'SRR7942395_GSM3406786_sAML_Control_1_Homo'] ['musculus', 'musculus', > 'sapiens', 'sapiens'] ['1', '2', '1', '2']
I for example tried this, but the output stays the same: wildcard_constraints: species = ‘!(sapiens)’
Could anyone suggest the correct code to get the wanted wildcards? Thanks in advance!
Advertisement
Answer
If suitable for your case, it would be better to prepare a sample sheet or yaml file that indicates the characteristics of each file (e.g. the species, dataset, etc.). Then use python code (e.g. using pandas) to extract the wildcard values and direct the pipeline.
In my opinion, extracting information from filenames is brittle unless the names have been created within the pipeline and you have full control over them.