I’m trying to make a large DataFrame from a bunch of smaller DF. so I’ve look at multiple sites a nd they all mention to use the pd.concat()
method and create an empty DataFrame. I did, however when I print inside my for loop I still get data as if it was still sectioned by individual DataFrame, and when i print my (now supposedly filled DataFrame ) I get an empty DF.
Note: All tables have the same structure
import camelot import pandas as pd # GETS 73 tables pdf_data = camelot.read_pdf('GMCC services 8517404.pdf', flavor='lattice', pages='1', encoding='utf-8' ) all_data = pd.DataFrame() for info in pdf_data: df= info.df # GETTABLE IN DATAFRAME # df = info.df # concatenate alltables in one dataframe # all_data = pd.concat([all_data, df.reset_index(drop=True)]) master = pd.concat([all_data, df.reset_index(drop=True)]) # CHANGE COLUMN NAMES FOR REAL DATA master = master.rename({0:'product_no',1:'description', 2:'quantity', 3:'net_price', 4:'total'}, axis='columns') # GET SITECODE site = master.iloc[-1, 0] # GET FIRST ROW && DROP FIRST ROW OF EVERY TABLE master = master.reindex(master.index.drop(0)).reset_index(drop=True) # # CREATE COLUMN SITE WITH SITE CODE master['site'] = site # Drop LAst ROW on EVERY TABLE master.drop(index=master.index[-1], axis=0, inplace=True) # # concatenate alltables in one dataframe # all_data = pd.concat([all_data, df.reset_index(drop=True)]) print(master) # services = all_data.groupby('description') # unique_sevices = services.apply(lambda x: x['net_price'].unique()) # print(unique_sevices)
what I’m currently getting:
product_no | description | quantity | net_price | site | 0 11223 | Gen desc. 1 | 5 | 19 | USCA | 1 22334 | Gen desc. 2 | 11 | 15 | USCA | product_no | description | quantity | net_price | site | 0 55667 | Gen desc. 5 | 10 | 15 | USNY | 1 22334 | Gen desc. 2 | 3 | 11 | USNY |
What I want:
product_no | description | quantity | net_price | site | 0 11223 | Gen desc. 1 | 5 | 19 | USCA | 1 22334 | Gen desc. 2 | 11 | 15 | USCA | 2 55667 | Gen desc. 5 | 10 | 15 | USNY | 3 22334 | Gen desc. 2 | 3 | 11 | USNY |
Advertisement
Answer
Try:
all_data = pd.concat([t.df for t in pdf_data], ignore_index=True)
If you want to loop and modify your data, you can collect them all in a list and use concat
after the loop:
list_of_df = list() for table in pdf_data: df = table.df.rename({0:'product_no',1:'description', 2:'quantity', 3:'net_price', 4:'total'}, axis=1) df["site"] = df.iat[-1, 0] df = df[1:-1] list_of_df.append(df) master = pd.concat(list_of_df, ignore_index=True)