Title may be confusing: I have a dataframe that displays user_id sign in’s during the week. My goal is to display the de-duped ID along with the de-duped dates per employee, in order to get a count of # days the user uniquely signed in for the week. So I’ve been trying to enforce a rule to make sure I’m only getting the distinct sign ins per day, per employee.
print(tempdf) ID date 10301 1/4/2021 10301 1/4/2021 10301 1/5/2021 10301 1/6/2021 10301 1/7/2021 10301 1/8/2021 10302 1/4/2021 10302 1/5/2021 10302 1/6/2021 10302 1/7/2021 10302 1/8/2021
The only thing I can think of is trying to pass some form of .nunique(), but this instance would apply that count to my entire dataframe, as well as filter out all of my ID’s before I can de-dupe them.
tempdf1=tempdf.groupby(['date'], as_index = False).nunique() tempdf1['# days signed in for week'] = tempdf1.groupby('ID')['ID'].transform('count') df=tempdf1.drop_duplicates(['ID']).copy()
This is the result I’m going for:
print(df) ID date # days signed in for week 10301 1/4/2021 5 10302 1/4/2021 5
Advertisement
Answer
- calculate start of week
- then it’s a simple use of
count()
df = pd.read_csv(io.StringIO("""ID date # days signed in for week 10301 1/4/2021 6 10301 1/4/2021 6 10301 1/5/2021 6 10301 1/6/2021 6 10301 1/7/2021 6 10301 1/8/2021 6 10302 1/4/2021 5 10302 1/5/2021 5 10302 1/6/2021 5 10302 1/7/2021 5 10302 1/8/2021 5"""), sep="ss+", engine="python") df.date = pd.to_datetime(df.date) df["weekStart"] = df['date'] - pd.to_timedelta(df['date'].dt.dayofweek, unit='d') df.groupby(["ID","weekStart"])["date"].count().reset_index() .rename(columns={"weekStart":"date","date":"# days signed in for week"})