Pandas group by unique ID and Distinct date per unique ID

Title may be confusing: I have a dataframe that displays user_id sign in’s during the week. My goal is to display the de-duped ID along with the de-duped dates per employee, in order to get a count of # days the user uniquely signed in for the week. So I’ve been trying to enforce a rule to make sure I’m only getting the distinct sign ins per day, per employee.

print(tempdf)
ID      date       
10301  1/4/2021   
10301  1/4/2021   
10301  1/5/2021    
10301  1/6/2021   
10301  1/7/2021    
10301  1/8/2021    
10302  1/4/2021    
10302  1/5/2021    
10302  1/6/2021    
10302  1/7/2021    
10302  1/8/2021

JavaScript
​x
 
print(tempdf)
ID      date       
10301  1/4/2021   
10301  1/4/2021   
10301  1/5/2021    
10301  1/6/2021   
10301  1/7/2021    
10301  1/8/2021    
10302  1/4/2021    
10302  1/5/2021    
10302  1/6/2021    
10302  1/7/2021    
10302  1/8/2021    
​

The only thing I can think of is trying to pass some form of .nunique(), but this instance would apply that count to my entire dataframe, as well as filter out all of my ID’s before I can de-dupe them.

tempdf1=tempdf.groupby(['date'], as_index = False).nunique()
tempdf1['# days signed in for week'] = tempdf1.groupby('ID')['ID'].transform('count')
df=tempdf1.drop_duplicates(['ID']).copy()

JavaScript
 
tempdf1=tempdf.groupby(['date'], as_index = False).nunique()
tempdf1['# days signed in for week'] = tempdf1.groupby('ID')['ID'].transform('count')
df=tempdf1.drop_duplicates(['ID']).copy()
​
​

This is the result I’m going for:

print(df)
ID      date    # days signed in for week     
10301  1/4/2021    5
10302  1/4/2021    5

JavaScript
 
print(df)
ID      date    # days signed in for week     
10301  1/4/2021    5
10302  1/4/2021    5
​

Answer

calculate start of week
then it’s a simple use of count()

df = pd.read_csv(io.StringIO("""ID      date    # days signed in for week     
10301  1/4/2021    6
10301  1/4/2021    6
10301  1/5/2021    6
10301  1/6/2021    6
10301  1/7/2021    6
10301  1/8/2021    6
10302  1/4/2021    5
10302  1/5/2021    5
10302  1/6/2021    5
10302  1/7/2021    5
10302  1/8/2021    5"""), sep="ss+", engine="python")

df.date = pd.to_datetime(df.date)
df["weekStart"] = df['date'] - pd.to_timedelta(df['date'].dt.dayofweek, unit='d')
df.groupby(["ID","weekStart"])["date"].count().reset_index()
.rename(columns={"weekStart":"date","date":"# days signed in for week"})

JavaScript
 
df = pd.read_csv(io.StringIO("""ID      date    # days signed in for week     
10301  1/4/2021    6
10301  1/4/2021    6
10301  1/5/2021    6
10301  1/6/2021    6
10301  1/7/2021    6
10301  1/8/2021    6
10302  1/4/2021    5
10302  1/5/2021    5
10302  1/6/2021    5
10302  1/7/2021    5
10302  1/8/2021    5"""), sep="ss+", engine="python")
​
df.date = pd.to_datetime(df.date)
df["weekStart"] = df['date'] - pd.to_timedelta(df['date'].dt.dayofweek, unit='d')
df.groupby(["ID","weekStart"])["date"].count().reset_index()
.rename(columns={"weekStart":"date","date":"# days signed in for week"})
​

Advertisement

Answer