Skip to content
Advertisement

In python pandas, How do you eliminate rows of data that fail to meet a condition of grouped data?

I have a data set that contains hourly data of marketing campaigns. There are several campaigns and not all of them are active during the 24 hours of the day. My goal is to eliminate all rows of active hour campaigns where I don’t have the 24 data rows of a single day.

The raw data contains a lot of information like this:

Original Data Set

I created a dummy variable with ones to be able to count single instance of rows. This is the code I applied to be able to see the results I want to get.

tmp = df.groupby(['id','date']).count()
tmp.query('Hour' > 23)

I get the following results:

Results of two lines of code

These results illustrate exactly the data that I want to keep in my data frame.

How can I eliminate the data per campaign per day that does not reach 24? The objective is not the count but the real data. Therefore ungrouped from what I present in the second picture.

I appreciate the guidance.

Advertisement

Answer

Use transform to broadcast the count over all rows of your dataframe the use loc as replacement of query:

out = df.loc[df.groupby(['id', 'date'])['Hour'].transform('count')
               .loc[lambda x: x > 23].index]
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement