Skip to content
Advertisement

Extract strings from a Dataframe looping over a single row

I’m reading multiple PDFs (using tabula) into data frames like this:

JavaScript

dataframe figure

My intention is to use that value ‘330736 1′ into the variable “number” and ’30/09/2015’ into a variable “date”.

The issue is that, although these values will always be located in row 1, the columns vary in an unpredictable way across the multiple PDFs.

Therefore, I tried to loop over the different columns of row 1, in order to extract these data regardless the columns they are:

JavaScript

However, without success… Any thoughts?

Advertisement

Answer

In your original code, if isinstance(df.iloc[1:2,i], str) will never evaluate to True for two reasons:

  1. Strings inside DataFrames are of type object
  2. df.iloc[1:2,i] will always be a pandas Series.

Since object is such a flexible type, it’s not as useful as str for identifying the data you want. In the code below, I simply used a space character to differentiate the data you want for n_nota. If this doesn’t work with your data, a regex pattern may be a good approach.

JavaScript

Edit: As noted below, the original code in the question text would have worked if each df.iloc[1:2,i] were replaced with df.iloc[1,i] as in:

JavaScript
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement