Skip to content
Advertisement

Loop through folder and subfolders and merge pdf

I tried to create a script to loop through parent folder and subfolders and merge all of the pdfs into one. Below if the code I wrote so far, but I don’t know how to combine them into one script.

Reference: Merge PDF files

The first function is to loop through all of the subfolders under parent folder and get a list of path for each pdf.

import os
from PyPDF2 import PdfFileMerger

root = r"folder path"
path = os.path.join(root, "folder path")

def list_dir():
    for path,subdirs,files in os.walk(root):
        for name in files:
            if name.endswith(".pdf") or name.endswith(".ipynb"):
                print (os.path.join(path,name))

            
            

Second, I created a list to append all of the path to pdf files in the subfolders and merge into one combined file. At this step, I was told:

TypeError: listdir: path should be string, bytes, os.PathLike or None, not list

root_folder = []
root_folder.append(list_dir())
    
def pdf_merge():
    
    merger = PdfFileMerger()    
    allpdfs = [a for a in os.listdir(root_folder)]

    
    for pdf in allpdfs:
        merger.append(open(pdf,'rb'))
        
    with open("Combined.pdf","wb") as new_file:
        merger.write(new_file)

pdf_merge()

Where and what should I modify the code in order to avoid the error and also combine two functions together?

Advertisement

Answer

First you have to create functions which create list with all files and return it.

def list_dir(root):
    result = []
    
    for path, dirs, files in os.walk(root):
        for name in files:
            if name.lower().endswith( (".pdf", ".ipynb") ):
                result.append(os.path.join(path, name))
                
    return result

I use also .lower() to catch extensions like .PDF.

endswith() can use tuple with all extensions.

It is good to get external values as arguments – list_dir(root) instead of list_dir()


And later you can use as

allpdfs = list_dir("folder path")

in

def pdf_merge(root):
    
    merger = PdfFileMerger()    
    allpdfs = list_dir(root)
    
    for pdf in allpdfs:
        merger.append(open(pdf, 'rb'))
        
    with open("Combined.pdf", 'wb') as new_file:
        merger.write(new_file)

pdf_merge("folder path")

EDIT:

First function could be even more universal if it would get also extensions

import os

def list_dir(root, exts=None):
    result = []
    
    for path, dirs, files in os.walk(root):
        for name in files:
            if exts and not name.lower().endswith(exts):
               continue 

            result.append(os.path.join(path, name))
                
    return result

all_files  = list_dir('folder_path')
all_pdfs   = list_dir('folder_path', '.pdf')
all_images = list_dir('folder_path', ('.png', '.jpg', '.gif'))

print(all_files)
print(all_pdfs)
print(all_images)

EDIT:

For single extension you can also do

improt glob

all_pdfs = glob.glob('folder_path/**/*.pdf', recursive=True)

It needs ** with recursive=True to search in subfolders.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement