I have two excel files, I want to perform vlookup and find difference of costs using python or even excel.
My files look like this
source_data.xlsx contains contains distance covered and their price, example distance range from 1 to 100 should be charged 4800 and distance range from 101 to 120 should be charged 5100.
DISTANCE COST 1-100 4800 101-120 5100 121-140 5500 141-160 5900 161-180 6200 181-200 6600 210-220 6900 221-240 7200
Analysis.xlsx
loading_station distance_travel total_cost status PUGU 40 4000 PAID PUGU 80 3200 PAID MOROGORO 50 5000 PAID MOROGORO 220 30400 PAID DODOMA 150 5100 PAID KIGOMA 90 2345 PAID DODOMA 230 6000 PAID DODOMA 180 16500 PAID KIGOMA 32 3000 PAID DODOMA 45 6000 PAID DODOMA 65 5000 PAID KIGOMA 77 1000 PAID KIGOMA 90 4000 PAID
Actual Cost for distance is given in source_data.xlsx
, I want to check cost in Analysis.xlsx
if it correspond to Actual value, I want to detect underpayment and overpayment.
Desired Output should be like this, with two column added, source_cost
which is taken from source_xlsx
by using vlookup
and difference which is difference between total_cost
and source_cost
loading_station distance_travel total_cost status source_cost Difference PUGU 40 4000 PAID 4800 -800 PUGU 80 3200 PAID 4800 -1600 MOROGORO 50 5000 PAID 4800 200 MOROGORO 220 30400 PAID 6900 23500 DODOMA 150 5100 PAID 5900 -800 KIGOMA 90 2345 PAID 4800 -2455 DODOMA 230 6000 PAID 7200 -1200 DODOMA 180 16500 PAID 6200 10300 KIGOMA 32 3000 PAID 4800 -1800 DODOMA 45 6000 PAID 4800 1200 DODOMA 65 5000 PAID 4800 200 KIGOMA 77 1000 PAID 4800 -3800 KIGOMA 90 4000 PAID 4800 -800
My code so far
# import pandas import pandas as pd # read excel data source_data = pd.read_excel('source_data.xlsx') analysis_file = pd.read_excel('analysis.xlsx') source_data.head(5) analysis_file.head(5)
Advertisement
Answer
Since it is a categorical bins problem, I suggest utilizing cut()
and find the corresponding value.
import pandas as pd # create bins bh = df_source['DISTANCE'].apply(lambda x: x.split('-')).apply(pd.Series).astype(int).values[:,0] bt = df_source['DISTANCE'].apply(lambda x: x.split('-')).apply(pd.Series).astype(int).values[:,1] bins = pd.IntervalIndex.from_arrays(bh, bt, closed='both') print(bins) ### IntervalIndex([[1, 100], [101, 120], [121, 140], [141, 160], [161, 180], [181, 200], [210, 220], [221, 240]], dtype='interval[int64, both]')
As it shown, IntervalIndex
, dtype='interval[int64, both]'
# find corresponding values df_analysis['source_cost'] = pd.cut(df_analysis['distance_travel'], bins=bins).map(dict(zip(bins, df_source['COST']))).astype(int) # calculation df_analysis['Difference'] = df_analysis['total_cost'] - df_analysis['source_cost'] print(df_analysis) ###
loading_station | distance_travel | total_cost | status | source_cost | Difference |
---|---|---|---|---|---|
PUGU | 40 | 4000 | PAID | 4800 | -800 |
PUGU | 80 | 3200 | PAID | 4800 | -1600 |
MOROGORO | 50 | 5000 | PAID | 4800 | 200 |
MOROGORO | 220 | 30400 | PAID | 6900 | 23500 |
DODOMA | 150 | 5100 | PAID | 5900 | -800 |
KIGOMA | 90 | 2345 | PAID | 4800 | -2455 |
DODOMA | 230 | 6000 | PAID | 7200 | -1200 |
DODOMA | 180 | 16500 | PAID | 6200 | 10300 |
KIGOMA | 32 | 3000 | PAID | 4800 | -1800 |
DODOMA | 45 | 6000 | PAID | 4800 | 1200 |
DODOMA | 65 | 5000 | PAID | 4800 | 200 |
KIGOMA | 77 | 1000 | PAID | 4800 | -3800 |
KIGOMA | 90 | 4000 | PAID | 4800 | -800 |