Extract strings from a Dataframe looping over a single row

Question

I'm reading multiple PDFs (using tabula) into data frames like this: dataframe figure My intention is to use that value '330736 1' into the variable "number" and '30/09/2015' into a variable "date". The issue is that, although these values will always be located in row 1, the columns vary in an unpredictable way across the multiple PDFs. Therefore, I tried

Accepted Answer

In your original code, if isinstance(df.iloc[1:2,i], str) will never evaluate to True for two reasons:Strings inside DataFrames are of type objectdf.iloc[1:2,i] will always be a pandas Series.Since object is such a flexible type, it&#8217;s not as useful as str for identifying the data you want.  In the code below, I simply used a space character to differentiate the data you want for n_nota.  If this doesn&#8217;t work with your data, a regex pattern may be a good approach.list_columns = df.columnsfor i in range(len(list_columns)):    if isinstance(df.iloc[1:2,i].values, object):        (df.iloc[1:2,i].values)        if "/" in str(df.iloc[1:2,i].values):            date = str(df.iloc[1:2,i].values[0]).strip()        elif " " in str(df.iloc[1:2,i].values):            n_nota = str(df.iloc[1:2,i].values[0]).strip()Edit:  As noted below, the original code in the question text would have worked if each df.iloc[1:2,i] were replaced with df.iloc[1,i] as in:list_columns = df.columnsfor i in range(len(list_columns)):    if isinstance(df.iloc[1,i], str):        if df.iloc[1:2,i].str.contains("/",na=False,regex=False).any():            date = str(df.iloc[1,i]).strip()        else:            n_nota = str(df.iloc[1,i]).strip()

Advertisement

Answer