I have a folder full of PDFs which I have parsed using Apache Tika, and I have a template excel file which I use to gather specific information from those PDFs and store using openpyxl.
The issue I am having is looping through using openpyxl rows.
For example, if there is just one PDF in folder, the values go in:
#C3, C4, F3, C13, C15, C17
but if there are more than one PDF’s, the index is just incremented by 20 for each additional PDF and stored in the same excel file so with 2 PDFs it will store the info in: C23, C24, F24 and so on.
import glob, os, fnmatch
from tika import parser
import openpyxl
input_path = "../Input"
template = "../Project/template.xlsm"
output_dir = "../Output"
pdfCounter = len(fnmatch.filter(os.listdir(input_path),'*.pdf'))
print(pdfCounter)
def extract_text(file):
parsed = parser.from_file(file)
parsed_text = parsed['content']
# print(parsed_text)
return parsed_text
wb = openpyxl.load_workbook(template, read_only=False, keep_vba=True)
sheet = wb.active
input_file = glob.glob(os.path.join(input_path, '*.pdf'))
for file in input_file:
do_stuff():
sheet['C3'] = value
sheet['C4'] = value2
sheet['F3'] = value3
sheet['C13'] = value4
sheet['C15'] = value5
sheet['C17'] = value6
output_file = os.path.join(output_dir, "report.xlsm")
wb.save(output_file)
I have a pdfCounter that counts the # of PDF’s in the folder and I am trying to figure out a way to increment the index based on that or if there is a better way to do this.
I just don’t understand how to loop based on # of pdf’s in file and increment the index by 20 so it doesn’t overwrite the same stuff as it is doing right now.
Advertisement
Answer
Edit: I can’t test this, but maybe it will work. Instead of looping through a range equal to pdfCounter, I’m looping over the files in input_file so I have access to the current file in the current iteration. The ‘value’ variables should be updated with info based on the current iteration’s file and then written to worksheet.
input_file = glob.glob(os.path.join(input_path, '*.pdf'))
def get_values(file):
do_stuff()
return value, value2, value3, value4, value5, value6
for x,file in enumerate(input_file):
C = UPPER[2]
D = UPPER[3]
F = UPPER[5]
first_cell = f'{C}{3 + (20 * x)}'
second_cell = f'{C}{4 + (20 * x)}'
third_cell = f'{F}{3 + (20 * x)}'
fourth_cell = f'{C}{13 + (20 * x)}'
fifth_cell = f'{C}{15 + (20 * x)}'
sixth_cell = f'{C}{17 + (20 * x)}'
value, value2, value3, value4, value5, value6 = get_values(file)
ws[first_cell] = value
ws[second_cell] = value2
ws[third_cell] = value3
ws[fourth_cell] = value4
ws[fifth_cell] = value5
ws[sixth_cell] = value6
output_file = os.path.join(output_dir, "report.xlsm")
wb.save(output_file)
Original:
from openpyxl import load_workbook
import openpyxl
from string import ascii_uppercase as UPPER
# file paths
# results from pdfCounter
pdfCounter = 5
# extract your text
# load your workbook
wb = load_workbook()
ws = wb['Sheet1']
# do stuff
for x in range(pdfCounter):
C = UPPER[2]
D = UPPER[3]
F = UPPER[5]
first_cell = f'{C}{3 + (20 * x)}'
second_cell = f'{C}{4 + (20 * x)}'
third_cell = f'{F}{3 + (20 * x)}'
fourth_cell = f'{C}{13 + (20 * x)}'
fifth_cell = f'{C}{15 + (20 * x)}'
sixth_cell = f'{C}{17 + (20 * x)}'
ws[first_cell] = value
ws[second_cell] = value2
ws[third_cell] = value3
ws[fourth_cell] = value4
ws[fifth_cell] = value5
ws[sixth_cell] = value6
You can use f-strings to make the excel cell identifier easily: make a variable for the column letter component, a variable for the row number component, then combine in the f-string as shown below. For the row number component, you can add the product of the for-loop index and 20 to the starting row number to achieve a +20 increment per iteration.
In order for this to work, you would need to do the value, value2, etc. calculations in each iteration of the for-loop. Something like this:
def get_values():
return value, value2, value3, value4, value5, value6
value, value2, value3, value4, value5, value6 = get_values()
Code output with pdfCounter = 5 for the target cells is:
C3 C4 F3 C13 C15 C17
C23 C24 F23 C33 C35 C37
C43 C44 F43 C53 C55 C57
C63 C64 F63 C73 C75 C77
C83 C84 F83 C93 C95 C97