The table looks like the following:
text | dummy1 | days |
---|---|---|
op123ac | 1 | 2000-01-01 |
op123ac | 0 | 2000-01-04 |
op123ac | 0 | 2000-01-07 |
op123ac | 0 | 2000-01-10 |
op1248ab | 0 | 2000-01-17 |
op1248ab | 1 | 2000-01-20 |
op1248ab | 1 | 2000-01-23 |
op1248ab | 1 | 2000-01-26 |
Each unique “text” have four repeated values correspond to four unique “days”. “days” are consecutive for each “text”. The problem is that each “text” must have one unique “dummy1”, so it must be 1 or 0 and :
df.groupby("text")['dummy'].sum() # Each "text" should get either 0 or 4, no other value are accept.
I figured the count
of error “dummy1” are less than the correct one. For example, “op123ac” has 1 and 0 in the “dummy1” column, count of 0(3) is much more than the count of 1(1) so the correct “dummy1” should be 0. “op1248ab” has more 1s than 0, so it’s correct value should be 1. The correct table should look like the following:
text | dummy1 | days |
---|---|---|
op123ac | 0 | 2000-01-01 |
op123ac | 0 | 2000-01-04 |
op123ac | 0 | 2000-01-07 |
op123ac | 0 | 2000-01-10 |
op1248ab | 1 | 2000-01-17 |
op1248ab | 1 | 2000-01-20 |
op1248ab | 1 | 2000-01-23 |
op1248ab | 1 | 2000-01-26 |
There is no way for number of 0s equal to number of 1s for any “text” value.
Advertisement
Answer
df['dummy1'] = df.groupby('text')['dummy1'].transform(lambda x: x.mode().iat[0])
Output
text dummy1 days 0 op123ac 0 2000-01-01 1 op123ac 0 2000-01-04 2 op123ac 0 2000-01-07 3 op123ac 0 2000-01-10 4 op1248ab 1 2000-01-17 5 op1248ab 1 2000-01-20 6 op1248ab 1 2000-01-23 7 op1248ab 1 2000-01-26
The mode of a set of values is the value that appears most often.