I have a very large pandas dataframe that contains two columns, column A and column B. For each value in column A, I would like to find the largest value in column B that is less than the corresponding value in column A. Note that each value in column B can be mapped to many values in column A.
Here’s an example with a smaller dataset. Let’s say I have the following dataframe:
df = pd.DataFrame({'a' : [1, 5, 7, 2, 3, 4], 'b' : [5, 2, 7, 5, 1, 9]})
I would like to find some third column — say c — so that
c = [nil, 2, 5, 1, 2, 2].
Note that each entry in c
is strictly less than the corresponding value in c
.
Upon researching, I think that I want something similar to pandas.merge_asof
, but I can’t quite get the query correct. In particular, I’m struggling because I only have one dataframe and not two. Perhaps I can form a second dataframe from my current one to get what I need, but I can’t quite get it right. Any help is appreciated.
Advertisement
Answer
Yes, it is doable using pandas.merge_asof
. Explanation as comments in the code –
import pandas as pd df = pd.DataFrame({'a' : [1, 5, 7, 2, 3, 4], 'b' : [5, 2, 7, 5, 1, 9]}) # merge_asof requires the keys to be sorted adf = df[['a']].sort_values(by='a') bdf = df[['b']].sort_values(by='b') # your example wants 'strictly less' so we also add 'allow_exact_matches=False' cdf_ordered = pd.merge_asof(adf, bdf, left_on='a', right_on='b', allow_exact_matches=False, direction='backward') # rename the dataframe |a|b| -> |a|c| cdf_ordered = cdf_ordered.rename(columns={'b': 'c'}) # since c is based on sorted a, we merge with original dataframe column a new_df = pd.merge(df, cdf_ordered, on='a') print(new_df) """ a b c 0 1 5 NaN 1 5 2 2.0 2 7 7 5.0 3 2 5 1.0 4 3 1 2.0 5 4 9 2.0 """