Skip to content
Advertisement

How to rename PDF file, with texts extracted from the PDF file?

I am trying to use Python to rename PDF file using part of the file content. Here is the situation.

The PDF file is a commercial invoice, contains wordings “Commercial Invoice” and “Department”. I want to rename the file to “Commercial Invoice” and ” Department “, such as “353624 HR”.

Here is what I have so far:

from StringIO import StringIO
import pyPdf
import os

# a function here
def getPDFContent(path):
    content = ""
    num_pages = 10
    p = file(path, "rb")
    pdf = pyPdf.PdfFileReader(p)
    for i in range(0, num_pages):
        content += pdf.getPage(i).extractText() + "n"
        content = " ".join(content.replace(u"xa0", " ").strip().split())     
        return content 

# name of the source PDF file
PDF_name = '222'

# picking texts from the PDF file
pdfContent = StringIO(getPDFContent("C:\" + PDF_name + ".pdf").encode("ascii", "ignore"))
for line in pdfContent:
    aaa = line.find(' Commercial Invoice ')
    CIN = line[aaa + 28: aaa + 38]
    bbb = line.find('Department')
    Dpt = line [bbb+20 : bbb+26]

    final_name = str(CIN + " " + Dpt)
    
print final_name

f = open("C:\" + PDF_name + ".pdf")
f.close()

os.rename("C:\" + PDF_name + ".pdf", "C:\" + final_name + ".pdf")

it works until print out the text extracted ‘ print final_name’, but at the last part when renaming the file, it gives an error ” WindowsError: [Error 32] The process cannot access the file because it is being used by another process”.

What went wrong here? it seems the file was once not closed properly?

Advertisement

Answer

in def getPDFContent(path), after p = file(path, "rb"), when the content has been copied, you need to close the file.

p.close()

put this just after the for loop but in the function.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement