I tried to create a script to loop through parent folder and subfolders and merge all of the pdfs into one. Below if the code I wrote so far, but I don’t know how to combine them into one script.
Reference: Merge PDF files
The first function is to loop through all of the subfolders under parent folder and get a list of path for each pdf.
import os from PyPDF2 import PdfFileMerger root = r"folder path" path = os.path.join(root, "folder path") def list_dir(): for path,subdirs,files in os.walk(root): for name in files: if name.endswith(".pdf") or name.endswith(".ipynb"): print (os.path.join(path,name))
Second, I created a list to append all of the path to pdf files in the subfolders and merge into one combined file. At this step, I was told:
TypeError: listdir: path should be string, bytes, os.PathLike or None, not list
root_folder = [] root_folder.append(list_dir()) def pdf_merge(): merger = PdfFileMerger() allpdfs = [a for a in os.listdir(root_folder)] for pdf in allpdfs: merger.append(open(pdf,'rb')) with open("Combined.pdf","wb") as new_file: merger.write(new_file) pdf_merge()
Where and what should I modify the code in order to avoid the error and also combine two functions together?
Advertisement
Answer
First you have to create functions which create list with all files and return
it.
def list_dir(root): result = [] for path, dirs, files in os.walk(root): for name in files: if name.lower().endswith( (".pdf", ".ipynb") ): result.append(os.path.join(path, name)) return result
I use also .lower()
to catch extensions like .PDF
.
endswith()
can use tuple with all extensions.
It is good to get external values as arguments – list_dir(root)
instead of list_dir()
And later you can use as
allpdfs = list_dir("folder path")
in
def pdf_merge(root): merger = PdfFileMerger() allpdfs = list_dir(root) for pdf in allpdfs: merger.append(open(pdf, 'rb')) with open("Combined.pdf", 'wb') as new_file: merger.write(new_file) pdf_merge("folder path")
EDIT:
First function could be even more universal if it would get also extensions
import os def list_dir(root, exts=None): result = [] for path, dirs, files in os.walk(root): for name in files: if exts and not name.lower().endswith(exts): continue result.append(os.path.join(path, name)) return result all_files = list_dir('folder_path') all_pdfs = list_dir('folder_path', '.pdf') all_images = list_dir('folder_path', ('.png', '.jpg', '.gif')) print(all_files) print(all_pdfs) print(all_images)
EDIT:
For single extension you can also do
improt glob all_pdfs = glob.glob('folder_path/**/*.pdf', recursive=True)
It needs **
with recursive=True
to search in subfolders.