Pandas group by unique ID and Distinct date per unique ID

Tags: ,



Title may be confusing: I have a dataframe that displays user_id sign in’s during the week. My goal is to display the de-duped ID along with the de-duped dates per employee, in order to get a count of # days the user uniquely signed in for the week. So I’ve been trying to enforce a rule to make sure I’m only getting the distinct sign ins per day, per employee.

print(tempdf)
ID      date       
10301  1/4/2021   
10301  1/4/2021   
10301  1/5/2021    
10301  1/6/2021   
10301  1/7/2021    
10301  1/8/2021    
10302  1/4/2021    
10302  1/5/2021    
10302  1/6/2021    
10302  1/7/2021    
10302  1/8/2021    

The only thing I can think of is trying to pass some form of .nunique(), but this instance would apply that count to my entire dataframe, as well as filter out all of my ID’s before I can de-dupe them.

tempdf1=tempdf.groupby(['date'], as_index = False).nunique()
tempdf1['# days signed in for week'] = tempdf1.groupby('ID')['ID'].transform('count')
df=tempdf1.drop_duplicates(['ID']).copy()

This is the result I’m going for:

print(df)
ID      date    # days signed in for week     
10301  1/4/2021    5
10302  1/4/2021    5

Answer

  • calculate start of week
  • then it’s a simple use of count()
df = pd.read_csv(io.StringIO("""ID      date    # days signed in for week     
10301  1/4/2021    6
10301  1/4/2021    6
10301  1/5/2021    6
10301  1/6/2021    6
10301  1/7/2021    6
10301  1/8/2021    6
10302  1/4/2021    5
10302  1/5/2021    5
10302  1/6/2021    5
10302  1/7/2021    5
10302  1/8/2021    5"""), sep="ss+", engine="python")

df.date = pd.to_datetime(df.date)
df["weekStart"] = df['date'] - pd.to_timedelta(df['date'].dt.dayofweek, unit='d')
df.groupby(["ID","weekStart"])["date"].count().reset_index()
.rename(columns={"weekStart":"date","date":"# days signed in for week"})


Source: stackoverflow