Skip to content
Advertisement

Extract strings from a Dataframe looping over a single row

I’m reading multiple PDFs (using tabula) into data frames like this:

nan = float('nan')
DataFrame_as_dict = {'Unnamed: 0': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan},
 'Unnamed: 1': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan},
 'NOTA DE NEGOCIAÇÃO': {0: nan,
  1: nan,
  2: 'Rico Investimentos - Grupo XP',
  3: 'Av. Presidente Juscelino Kubitschek - Torre Sul, 1909 - 25o ANDAR VILA OLIMPIA 4543-907',
  4: 'Tel. 3003-5465Fax: (55 11) 4007-2465',
  5: 'Internet: www.rico.com.vc SAC: 0800-774-0402 e-mail: atendimento@rico.com.vc'},
 'Unnamed: 3': {0: 'Nr. nota Folha',
  1: '330736 1',
  2: nan,
  3: 'SÃO PAULO - SP',
  4: nan,
  5: nan},
 'Unnamed: 4': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan},
 'Unnamed: 5': {0: 'Data pregão',
  1: '30/09/2015',
  2: nan,
  3: nan,
  4: nan,
  5: nan}}

df = pd.DataFrame(DataFrame_as_dict)

dataframe figure

My intention is to use that value ‘330736 1′ into the variable “number” and ’30/09/2015’ into a variable “date”.

The issue is that, although these values will always be located in row 1, the columns vary in an unpredictable way across the multiple PDFs.

Therefore, I tried to loop over the different columns of row 1, in order to extract these data regardless the columns they are:

list_columns = df.columns
for i in range(len(list_columns)):
    if isinstance(df.iloc[1:2,i], str):
        if df.iloc[1:2,i].str.contains("/",na=False,regex=False).any():
            date = str(df.iloc[1:2,i]).strip()
        else:
            n_nota = str(df.iloc[1:2,i]).strip()

However, without success… Any thoughts?

Advertisement

Answer

In your original code, if isinstance(df.iloc[1:2,i], str) will never evaluate to True for two reasons:

  1. Strings inside DataFrames are of type object
  2. df.iloc[1:2,i] will always be a pandas Series.

Since object is such a flexible type, it’s not as useful as str for identifying the data you want. In the code below, I simply used a space character to differentiate the data you want for n_nota. If this doesn’t work with your data, a regex pattern may be a good approach.

list_columns = df.columns
for i in range(len(list_columns)):
    if isinstance(df.iloc[1:2,i].values, object):
        (df.iloc[1:2,i].values)
        if "/" in str(df.iloc[1:2,i].values):
            date = str(df.iloc[1:2,i].values[0]).strip()
        elif " " in str(df.iloc[1:2,i].values):
            n_nota = str(df.iloc[1:2,i].values[0]).strip()

Edit: As noted below, the original code in the question text would have worked if each df.iloc[1:2,i] were replaced with df.iloc[1,i] as in:

list_columns = df.columns
for i in range(len(list_columns)):
    if isinstance(df.iloc[1,i], str):
        if df.iloc[1:2,i].str.contains("/",na=False,regex=False).any():
            date = str(df.iloc[1,i]).strip()
        else:
            n_nota = str(df.iloc[1,i]).strip()  
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement