I have a dataframe with three columns lets say
JavaScript
x
5
1
Name Address Date
2
faraz xyz 2022-01-01
3
Abdul abc 2022-06-06
4
Zara qrs 2021-02-25
5
I want to compare each date in Date column with all the other dates in the Date column and only keep those rows which lie within 6 months of atleast one of all the dates.
JavaScript
1
5
1
for example: (2022-01-01 - 2022-06-06) = 5 months so we keep both these dates
2
but,
3
(2022-06-06 - 2021-02-25) and (2022-01-01 - 2021-02-25) exceed the 6 month limit
4
so we will drop that row.
5
Desired Output:
JavaScript
1
4
1
Name Address Date
2
faraz xyz 2022-01-01
3
Abdul abc 2022-06-06
4
I have tried a couple of approches such a nested loops, but I got 1 million+ entries and it takes forever to run that loop. Some of the dates repeat too. Not all are unique.
JavaScript
1
13
13
1
for index, row in dupes_df.iterrows():
2
3
for date in uniq_dates_list:
4
5
format_date = datetime.strptime(date,'%d/%m/%y')
6
7
if (( format_date.year - row['JournalDate'].year ) * 12 + ( format_date.month - row['JournalDate'].month ) <= 6):
8
9
print("here here")
10
break
11
else:
12
dupes_df.drop(index, inplace=True)
13
I need a much more omptimal solution for it. Studied about lamba functions, but couldn’t get to the depths of it.
Advertisement
Answer
IIUC, this should work for you:
JavaScript
1
15
15
1
import pandas as pd
2
import itertools
3
from io import StringIO
4
5
data = StringIO("""Name;Address;Date
6
faraz;xyz;2022-01-01
7
Abdul;abc;2022-06-06
8
Zara;qrs;2021-02-25
9
""")
10
df = pd.read_csv(data, sep=';', parse_dates=['Date'])
11
12
df_date = pd.DataFrame([sorted(l, reverse=True) for l in itertools.combinations(df['Date'], 2)], columns=['Date1', 'Date2'])
13
df_date['diff'] = (df_date['Date1'] - df_date['Date2']).dt.days
14
df[df.Date.isin(df_date[df_date['diff'] <= 180].iloc[:, :-1].T[0])]
15
Output:
JavaScript
1
4
1
Name Address Date
2
0 faraz xyz 2022-01-01
3
1 Abdul abc 2022-06-06
4