How to compare each date in a cell with all the dates in a column

I have a dataframe with three columns lets say

Name  Address Date
faraz  xyz    2022-01-01
Abdul  abc    2022-06-06
Zara   qrs    2021-02-25

JavaScript
​x
 
Name  Address Date
faraz  xyz    2022-01-01
Abdul  abc    2022-06-06
Zara   qrs    2021-02-25
​

I want to compare each date in Date column with all the other dates in the Date column and only keep those rows which lie within 6 months of atleast one of all the dates.

for example: (2022-01-01 - 2022-06-06) = 5 months so we keep both these dates
but, 
(2022-06-06 - 2021-02-25) and (2022-01-01 - 2021-02-25) exceed the 6 month limit 
so we will drop that row.

JavaScript
 
for example: (2022-01-01 - 2022-06-06) = 5 months so we keep both these dates
but, 
(2022-06-06 - 2021-02-25) and (2022-01-01 - 2021-02-25) exceed the 6 month limit 
so we will drop that row.
​

Desired Output:

Name  Address Date
faraz  xyz    2022-01-01
Abdul  abc    2022-06-06

JavaScript
 
Name  Address Date
faraz  xyz    2022-01-01
Abdul  abc    2022-06-06
​

I have tried a couple of approches such a nested loops, but I got 1 million+ entries and it takes forever to run that loop. Some of the dates repeat too. Not all are unique.

for index, row in dupes_df.iterrows():

    for date in uniq_dates_list:

        format_date = datetime.strptime(date,'%d/%m/%y')

        if (( format_date.year - row['JournalDate'].year ) * 12 + ( format_date.month - row['JournalDate'].month ) <= 6):

            print("here here")
            break
        else:
            dupes_df.drop(index, inplace=True)

JavaScript
 
for index, row in dupes_df.iterrows():
​
    for date in uniq_dates_list:
​
        format_date = datetime.strptime(date,'%d/%m/%y')
​
        if (( format_date.year - row['JournalDate'].year ) * 12 + ( format_date.month - row['JournalDate'].month ) <= 6):
​
            print("here here")
            break
        else:
            dupes_df.drop(index, inplace=True)
​

I need a much more omptimal solution for it. Studied about lamba functions, but couldn’t get to the depths of it.

Answer

IIUC, this should work for you:

import pandas as pd
import itertools
from io import StringIO

data = StringIO("""Name;Address;Date
faraz;xyz;2022-01-01
Abdul;abc;2022-06-06
Zara;qrs;2021-02-25
""")
df = pd.read_csv(data, sep=';', parse_dates=['Date'])

df_date = pd.DataFrame([sorted(l, reverse=True) for l in itertools.combinations(df['Date'], 2)], columns=['Date1', 'Date2'])
df_date['diff'] = (df_date['Date1'] - df_date['Date2']).dt.days
df[df.Date.isin(df_date[df_date['diff'] <= 180].iloc[:, :-1].T[0])]

JavaScript
 
import pandas as pd
import itertools
from io import StringIO
​
data = StringIO("""Name;Address;Date
faraz;xyz;2022-01-01
Abdul;abc;2022-06-06
Zara;qrs;2021-02-25
""")
df = pd.read_csv(data, sep=';', parse_dates=['Date'])
​
df_date = pd.DataFrame([sorted(l, reverse=True) for l in itertools.combinations(df['Date'], 2)], columns=['Date1', 'Date2'])
df_date['diff'] = (df_date['Date1'] - df_date['Date2']).dt.days
df[df.Date.isin(df_date[df_date['diff'] <= 180].iloc[:, :-1].T[0])]
​

Output:

    Name Address       Date
0  faraz     xyz 2022-01-01
1  Abdul     abc 2022-06-06

JavaScript
 
    Name Address       Date
0  faraz     xyz 2022-01-01
1  Abdul     abc 2022-06-06
​

Advertisement

Answer