I have a dataset where each row contains information that needs to be separated and printed in different rows, but I need to keep the name of the company on each newly printed row:
example dataset These are the headers:
JavaScript
x
2
1
company | marketing_budget | marketing_remaining | finance_budget | finance_remaining | sales_budget | sales_remaining
2
These are 2 rows of data:
JavaScript
1
3
1
Law Office | 450,000 | 150,000 | 300,000 | 100,000 | 200,000 | 50,000
2
Restaurant | 30,000 | 7,000 | null | null | 25,000 | 10,000
3
I need to separate one line into as many as I need. Some companies might have a marketing budget but don’t have a finance budget or any other possible combination… So the output should look like this (also I need to add the department, which is not included as a column, it is only the title of the column where the info is taken)
JavaScript
1
7
1
Company | Department | Budget | Amount Remaining
2
Law Office | Marketing | 450,000 | 150,000
3
Law Office | Finace | 300,000 | 100,000
4
Law Office | Sales | 200,00 | 50,000
5
Restaurant | Marketing | 30,000 | 7,000
6
Restaurant | Sales | 25,000 | 10,000
7
Advertisement
Answer
Given a text file that looks like:
JavaScript
1
3
1
Law Office | 450,000 | 150,000 | 300,000 | 100,000 | 200,000 | 50,000
2
Restaurant | 30,000 | 7,000 | null | null | 25,000 | 10,000
3
We can do:
JavaScript
1
10
10
1
df = pd.read_csv('file.txt', sep=' | ', engine='python')
2
3
# Reverse the column names on '_'.
4
df.columns = ['_'.join(reversed(x.split('_'))) for x in df.columns]
5
6
# Use pd.wide_to_long
7
df = pd.wide_to_long(df, ['budget', 'remaining'], i='company', j='department', sep='_', suffix=r'w+').sort_index()
8
df = df.reset_index().dropna()
9
print(df)
10
Output:
JavaScript
1
7
1
company department budget remaining
2
0 Law Office finance 300,000 100,000
3
1 Law Office marketing 450,000 150,000
4
2 Law Office sales 200,000 50,000
5
4 Restaurant marketing 30,000 7,000
6
5 Restaurant sales 25,000 10,000
7
Testing, and how I’d make the values numeric for future calculations:
JavaScript
1
27
27
1
import pandas as pd
2
from io import StringIO
3
4
d='''company | marketing_budget | marketing_remaining | finance_budget | finance_remaining | sales_budget | sales_remaining
5
Law Office | 450,000 | 150,000 | 300,000 | 100,000 | 200,000 | 50,000
6
Restaurant | 30,000 | 7,000 | null | null | 25,000 | 10,000'''
7
8
df = pd.read_csv(StringIO(d), sep=' | ', engine='python')
9
df = df.fillna('').applymap(lambda x: x.replace(',', ''))
10
for col in df.columns:
11
df[col] = pd.to_numeric(df[col], errors='ignore')
12
13
df.columns = ['_'.join(reversed(x.split('_'))) for x in df.columns]
14
15
df = pd.wide_to_long(df, ['budget', 'remaining'], i='company', j='department', sep='_', suffix=r'w+').sort_index()
16
df = df.reset_index().dropna()
17
print(df)
18
19
.
20
21
company department budget remaining
22
0 Law Office finance 300000.0 100000.0
23
1 Law Office marketing 450000.0 150000.0
24
2 Law Office sales 200000.0 50000.0
25
4 Restaurant marketing 30000.0 7000.0
26
5 Restaurant sales 25000.0 10000.0
27