Skip to content
Advertisement

Concat multiple small DataFrames as one big DataFrame

I’m trying to make a large DataFrame from a bunch of smaller DF. so I’ve look at multiple sites a nd they all mention to use the pd.concat() method and create an empty DataFrame. I did, however when I print inside my for loop I still get data as if it was still sectioned by individual DataFrame, and when i print my (now supposedly filled DataFrame ) I get an empty DF.

Note: All tables have the same structure

import camelot
import pandas as pd


# GETS  73 tables 
pdf_data = camelot.read_pdf('GMCC services 8517404.pdf', flavor='lattice', pages='1', encoding='utf-8' )
all_data = pd.DataFrame()


for info  in pdf_data:

    df= info.df
    # GETTABLE IN  DATAFRAME
    # df = info.df

  # concatenate alltables in one  dataframe 
    # all_data = pd.concat([all_data, df.reset_index(drop=True)])
    master = pd.concat([all_data, df.reset_index(drop=True)])
       
    # CHANGE  COLUMN NAMES  FOR   REAL DATA 
    master = master.rename({0:'product_no',1:'description', 2:'quantity', 3:'net_price', 4:'total'}, axis='columns')

    # GET  SITECODE 
    site = master.iloc[-1, 0]

    # GET FIRST ROW && DROP FIRST ROW  OF EVERY TABLE
    master = master.reindex(master.index.drop(0)).reset_index(drop=True)

    # # CREATE  COLUMN  SITE  WITH  SITE CODE
    master['site'] = site

    # Drop LAst ROW on EVERY TABLE
    master.drop(index=master.index[-1], 
        axis=0, 
        inplace=True)

    #    # concatenate alltables in one  dataframe 
    #     all_data = pd.concat([all_data, df.reset_index(drop=True)]) 

    print(master)
# services = all_data.groupby('description')
# unique_sevices = services.apply(lambda x: x['net_price'].unique())


# print(unique_sevices) 

what I’m currently getting:

 product_no    |    description     |    quantity    |  net_price    |    site    | 
0 11223        |    Gen desc. 1     |        5       |      19       |    USCA    |
1 22334        |    Gen desc. 2     |        11      |      15       |    USCA    |

 product_no    |    description     |    quantity    |  net_price    |    site    |
0 55667        |    Gen desc. 5     |        10      |      15       |    USNY    |
1 22334        |    Gen desc. 2     |        3       |      11       |    USNY    |

What I want:

 product_no    |    description     |    quantity    |  net_price    |    site    | 
0 11223        |    Gen desc. 1     |        5       |      19       |    USCA    |
1 22334        |    Gen desc. 2     |        11      |      15       |    USCA    |
2 55667        |    Gen desc. 5     |        10      |      15       |    USNY    |
3 22334        |    Gen desc. 2     |        3       |      11       |    USNY    |

Advertisement

Answer

Try:

all_data = pd.concat([t.df for t in pdf_data], ignore_index=True)

If you want to loop and modify your data, you can collect them all in a list and use concat after the loop:

list_of_df = list()
for table in pdf_data:
    df = table.df.rename({0:'product_no',1:'description', 2:'quantity', 3:'net_price', 4:'total'}, axis=1)
    df["site"] = df.iat[-1, 0]
    df = df[1:-1]
    list_of_df.append(df)

master = pd.concat(list_of_df, ignore_index=True)
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement