I have a folder full of PDFs which I have parsed using Apache Tika, and I have a template excel file which I use to gather specific information from those PDFs and store using openpyxl.
The issue I am having is looping through using openpyxl rows.
For example, if there is just one PDF in folder, the values go in:
#C3, C4, F3, C13, C15, C17
but if there are more than one PDF’s, the index is just incremented by 20 for each additional PDF and stored in the same excel file so with 2 PDFs it will store the info in: C23, C24, F24 and so on.
import glob, os, fnmatch from tika import parser import openpyxl input_path = "../Input" template = "../Project/template.xlsm" output_dir = "../Output" pdfCounter = len(fnmatch.filter(os.listdir(input_path),'*.pdf')) print(pdfCounter) def extract_text(file): parsed = parser.from_file(file) parsed_text = parsed['content'] # print(parsed_text) return parsed_text wb = openpyxl.load_workbook(template, read_only=False, keep_vba=True) sheet = wb.active input_file = glob.glob(os.path.join(input_path, '*.pdf')) for file in input_file: do_stuff(): sheet['C3'] = value sheet['C4'] = value2 sheet['F3'] = value3 sheet['C13'] = value4 sheet['C15'] = value5 sheet['C17'] = value6 output_file = os.path.join(output_dir, "report.xlsm") wb.save(output_file)
I have a pdfCounter that counts the # of PDF’s in the folder and I am trying to figure out a way to increment the index based on that or if there is a better way to do this.
I just don’t understand how to loop based on # of pdf’s in file and increment the index by 20 so it doesn’t overwrite the same stuff as it is doing right now.
Advertisement
Answer
Edit: I can’t test this, but maybe it will work. Instead of looping through a range equal to pdfCounter, I’m looping over the files in input_file so I have access to the current file in the current iteration. The ‘value’ variables should be updated with info based on the current iteration’s file and then written to worksheet.
input_file = glob.glob(os.path.join(input_path, '*.pdf')) def get_values(file): do_stuff() return value, value2, value3, value4, value5, value6 for x,file in enumerate(input_file): C = UPPER[2] D = UPPER[3] F = UPPER[5] first_cell = f'{C}{3 + (20 * x)}' second_cell = f'{C}{4 + (20 * x)}' third_cell = f'{F}{3 + (20 * x)}' fourth_cell = f'{C}{13 + (20 * x)}' fifth_cell = f'{C}{15 + (20 * x)}' sixth_cell = f'{C}{17 + (20 * x)}' value, value2, value3, value4, value5, value6 = get_values(file) ws[first_cell] = value ws[second_cell] = value2 ws[third_cell] = value3 ws[fourth_cell] = value4 ws[fifth_cell] = value5 ws[sixth_cell] = value6 output_file = os.path.join(output_dir, "report.xlsm") wb.save(output_file)
Original:
from openpyxl import load_workbook import openpyxl from string import ascii_uppercase as UPPER # file paths # results from pdfCounter pdfCounter = 5 # extract your text # load your workbook wb = load_workbook() ws = wb['Sheet1'] # do stuff for x in range(pdfCounter): C = UPPER[2] D = UPPER[3] F = UPPER[5] first_cell = f'{C}{3 + (20 * x)}' second_cell = f'{C}{4 + (20 * x)}' third_cell = f'{F}{3 + (20 * x)}' fourth_cell = f'{C}{13 + (20 * x)}' fifth_cell = f'{C}{15 + (20 * x)}' sixth_cell = f'{C}{17 + (20 * x)}' ws[first_cell] = value ws[second_cell] = value2 ws[third_cell] = value3 ws[fourth_cell] = value4 ws[fifth_cell] = value5 ws[sixth_cell] = value6
You can use f-strings to make the excel cell identifier easily: make a variable for the column letter component, a variable for the row number component, then combine in the f-string as shown below. For the row number component, you can add the product of the for-loop index and 20 to the starting row number to achieve a +20 increment per iteration.
In order for this to work, you would need to do the value, value2, etc. calculations in each iteration of the for-loop. Something like this:
def get_values(): return value, value2, value3, value4, value5, value6 value, value2, value3, value4, value5, value6 = get_values()
Code output with pdfCounter = 5 for the target cells is:
C3 C4 F3 C13 C15 C17 C23 C24 F23 C33 C35 C37 C43 C44 F43 C53 C55 C57 C63 C64 F63 C73 C75 C77 C83 C84 F83 C93 C95 C97