Skip to content
Advertisement

How do I go about incrementally adding data using openpyxl?

I have a folder full of PDFs which I have parsed using Apache Tika, and I have a template excel file which I use to gather specific information from those PDFs and store using openpyxl.

The issue I am having is looping through using openpyxl rows.

For example, if there is just one PDF in folder, the values go in:

#C3, C4, F3, C13, C15, C17

but if there are more than one PDF’s, the index is just incremented by 20 for each additional PDF and stored in the same excel file so with 2 PDFs it will store the info in: C23, C24, F24 and so on.

import glob, os, fnmatch
from tika import parser
import openpyxl

input_path = "../Input"
template = "../Project/template.xlsm"
output_dir = "../Output"

pdfCounter = len(fnmatch.filter(os.listdir(input_path),'*.pdf'))
print(pdfCounter)

def extract_text(file):
    parsed = parser.from_file(file)
    parsed_text = parsed['content']
    # print(parsed_text)
    return parsed_text

wb = openpyxl.load_workbook(template, read_only=False, keep_vba=True)
sheet = wb.active

input_file = glob.glob(os.path.join(input_path, '*.pdf'))

for file in input_file:
    do_stuff():


sheet['C3'] = value
sheet['C4'] = value2
sheet['F3'] = value3
sheet['C13'] = value4
sheet['C15'] = value5
sheet['C17'] = value6


output_file = os.path.join(output_dir, "report.xlsm")
wb.save(output_file)

I have a pdfCounter that counts the # of PDF’s in the folder and I am trying to figure out a way to increment the index based on that or if there is a better way to do this.

I just don’t understand how to loop based on # of pdf’s in file and increment the index by 20 so it doesn’t overwrite the same stuff as it is doing right now.

Advertisement

Answer

Edit: I can’t test this, but maybe it will work. Instead of looping through a range equal to pdfCounter, I’m looping over the files in input_file so I have access to the current file in the current iteration. The ‘value’ variables should be updated with info based on the current iteration’s file and then written to worksheet.

input_file = glob.glob(os.path.join(input_path, '*.pdf'))

def get_values(file):
    do_stuff()
    return value, value2, value3, value4, value5, value6

for x,file in enumerate(input_file):
    C = UPPER[2]
    D = UPPER[3]
    F = UPPER[5]
    first_cell = f'{C}{3 + (20 * x)}'
    second_cell = f'{C}{4 + (20 * x)}'
    third_cell = f'{F}{3 + (20 * x)}'
    fourth_cell = f'{C}{13 + (20 * x)}'
    fifth_cell = f'{C}{15 + (20 * x)}'
    sixth_cell = f'{C}{17 + (20 * x)}'

    value, value2, value3, value4, value5, value6 = get_values(file)

    ws[first_cell] = value
    ws[second_cell] = value2
    ws[third_cell] = value3
    ws[fourth_cell] = value4
    ws[fifth_cell] = value5
    ws[sixth_cell] = value6


output_file = os.path.join(output_dir, "report.xlsm")
wb.save(output_file)

Original:

from openpyxl import load_workbook
import openpyxl
from string import ascii_uppercase as UPPER

# file paths

# results from pdfCounter
pdfCounter = 5

# extract your text

# load your workbook
wb = load_workbook()
ws = wb['Sheet1']

# do stuff

for x in range(pdfCounter):
    C = UPPER[2]
    D = UPPER[3]
    F = UPPER[5]
    first_cell = f'{C}{3 + (20 * x)}'
    second_cell = f'{C}{4 + (20 * x)}'
    third_cell = f'{F}{3 + (20 * x)}'
    fourth_cell = f'{C}{13 + (20 * x)}'
    fifth_cell = f'{C}{15 + (20 * x)}'
    sixth_cell = f'{C}{17 + (20 * x)}'

    ws[first_cell] = value
    ws[second_cell] = value2
    ws[third_cell] = value3
    ws[fourth_cell] = value4
    ws[fifth_cell] = value5
    ws[sixth_cell] = value6

You can use f-strings to make the excel cell identifier easily: make a variable for the column letter component, a variable for the row number component, then combine in the f-string as shown below. For the row number component, you can add the product of the for-loop index and 20 to the starting row number to achieve a +20 increment per iteration.

In order for this to work, you would need to do the value, value2, etc. calculations in each iteration of the for-loop. Something like this:

def get_values():
    return value, value2, value3, value4, value5, value6

value, value2, value3, value4, value5, value6 = get_values()

Code output with pdfCounter = 5 for the target cells is:

C3 C4 F3 C13 C15 C17
C23 C24 F23 C33 C35 C37
C43 C44 F43 C53 C55 C57
C63 C64 F63 C73 C75 C77
C83 C84 F83 C93 C95 C97
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement