I have a dataframe with three columns lets say
Name Address Date faraz xyz 2022-01-01 Abdul abc 2022-06-06 Zara qrs 2021-02-25
I want to compare each date in Date column with all the other dates in the Date column and only keep those rows which lie within 6 months of atleast one of all the dates.
for example: (2022-01-01 - 2022-06-06) = 5 months so we keep both these dates but, (2022-06-06 - 2021-02-25) and (2022-01-01 - 2021-02-25) exceed the 6 month limit so we will drop that row.
Desired Output:
Name Address Date faraz xyz 2022-01-01 Abdul abc 2022-06-06
I have tried a couple of approches such a nested loops, but I got 1 million+ entries and it takes forever to run that loop. Some of the dates repeat too. Not all are unique.
for index, row in dupes_df.iterrows(): for date in uniq_dates_list: format_date = datetime.strptime(date,'%d/%m/%y') if (( format_date.year - row['JournalDate'].year ) * 12 + ( format_date.month - row['JournalDate'].month ) <= 6): print("here here") break else: dupes_df.drop(index, inplace=True)
I need a much more omptimal solution for it. Studied about lamba functions, but couldn’t get to the depths of it.
Advertisement
Answer
IIUC, this should work for you:
import pandas as pd import itertools from io import StringIO data = StringIO("""Name;Address;Date faraz;xyz;2022-01-01 Abdul;abc;2022-06-06 Zara;qrs;2021-02-25 """) df = pd.read_csv(data, sep=';', parse_dates=['Date']) df_date = pd.DataFrame([sorted(l, reverse=True) for l in itertools.combinations(df['Date'], 2)], columns=['Date1', 'Date2']) df_date['diff'] = (df_date['Date1'] - df_date['Date2']).dt.days df[df.Date.isin(df_date[df_date['diff'] <= 180].iloc[:, :-1].T[0])]
Output:
Name Address Date 0 faraz xyz 2022-01-01 1 Abdul abc 2022-06-06