I’m trying to make a large DataFrame from a bunch of smaller DF. so I’ve look at multiple sites a nd they all mention to use the pd.concat()
method and create an empty DataFrame. I did, however when I print inside my for loop I still get data as if it was still sectioned by individual DataFrame, and when i print my (now supposedly filled DataFrame ) I get an empty DF.
Note: All tables have the same structure
JavaScript
x
46
46
1
import camelot
2
import pandas as pd
3
4
5
# GETS 73 tables
6
pdf_data = camelot.read_pdf('GMCC services 8517404.pdf', flavor='lattice', pages='1', encoding='utf-8' )
7
all_data = pd.DataFrame()
8
9
10
for info in pdf_data:
11
12
df= info.df
13
# GETTABLE IN DATAFRAME
14
# df = info.df
15
16
# concatenate alltables in one dataframe
17
# all_data = pd.concat([all_data, df.reset_index(drop=True)])
18
master = pd.concat([all_data, df.reset_index(drop=True)])
19
20
# CHANGE COLUMN NAMES FOR REAL DATA
21
master = master.rename({0:'product_no',1:'description', 2:'quantity', 3:'net_price', 4:'total'}, axis='columns')
22
23
# GET SITECODE
24
site = master.iloc[-1, 0]
25
26
# GET FIRST ROW && DROP FIRST ROW OF EVERY TABLE
27
master = master.reindex(master.index.drop(0)).reset_index(drop=True)
28
29
# # CREATE COLUMN SITE WITH SITE CODE
30
master['site'] = site
31
32
# Drop LAst ROW on EVERY TABLE
33
master.drop(index=master.index[-1],
34
axis=0,
35
inplace=True)
36
37
# # concatenate alltables in one dataframe
38
# all_data = pd.concat([all_data, df.reset_index(drop=True)])
39
40
print(master)
41
# services = all_data.groupby('description')
42
# unique_sevices = services.apply(lambda x: x['net_price'].unique())
43
44
45
# print(unique_sevices)
46
what I’m currently getting:
JavaScript
1
8
1
product_no | description | quantity | net_price | site |
2
0 11223 | Gen desc. 1 | 5 | 19 | USCA |
3
1 22334 | Gen desc. 2 | 11 | 15 | USCA |
4
5
product_no | description | quantity | net_price | site |
6
0 55667 | Gen desc. 5 | 10 | 15 | USNY |
7
1 22334 | Gen desc. 2 | 3 | 11 | USNY |
8
What I want:
JavaScript
1
6
1
product_no | description | quantity | net_price | site |
2
0 11223 | Gen desc. 1 | 5 | 19 | USCA |
3
1 22334 | Gen desc. 2 | 11 | 15 | USCA |
4
2 55667 | Gen desc. 5 | 10 | 15 | USNY |
5
3 22334 | Gen desc. 2 | 3 | 11 | USNY |
6
Advertisement
Answer
Try:
JavaScript
1
2
1
all_data = pd.concat([t.df for t in pdf_data], ignore_index=True)
2
If you want to loop and modify your data, you can collect them all in a list and use concat
after the loop:
JavaScript
1
9
1
list_of_df = list()
2
for table in pdf_data:
3
df = table.df.rename({0:'product_no',1:'description', 2:'quantity', 3:'net_price', 4:'total'}, axis=1)
4
df["site"] = df.iat[-1, 0]
5
df = df[1:-1]
6
list_of_df.append(df)
7
8
master = pd.concat(list_of_df, ignore_index=True)
9