I have a pandas dataframe like this.
JavaScript
x
7
1
Time Source Level County Town
2
0 2021-12-01 10:01:41.443 NaN NaN NaN NaN
3
1 NaN Test 3 C1 C1-T1
4
2 NaN Test 5- C2 C2-T0
5
3 NaN Test 5- C2 C2-T1
6
4 2021-12-01 10:01:46.452 NaN NaN NaN NaN
7
I want to append Town value, which is based on row have the same Source, Level and County value.
I have tried isin, groupby, diff(but my value is str), but still not figure out.
Image below is what I want to get.
JavaScript
1
6
1
Time Source Level County Town
2
0 2021-12-01 10:01:41.443 NaN NaN NaN NaN
3
1 NaN Test 3 C1 C1-T0
4
2 NaN Test 5- C2 C2-T0, C2-T1
5
3 2021-12-01 10:01:46.452 NaN NaN NaN NaN
6
Really appreciate your help!
Advertisement
Answer
The way we can make this work is by creating a list out of it using groupby()
and apply(list)
, we can then transform this into a string separated by comma. Let’s split it into 2 steps for better understanding.
Personally I would keep this data as a list within a pandas series and not do step 2. Formatting as string separated by comma might not be ideal to work with.
Step 1:
JavaScript
1
2
1
output = df.groupby(['Time','Source','Level','County'])['Town'].apply(list).reset_index()
2
Returns:
JavaScript
1
6
1
Time Source Level County Town
2
0 2021-12-01 10:01:41.443 NaN NaN NaN [nan]
3
1 2021-12-01 10:01:46.452 NaN NaN NaN [nan]
4
2 NaN Test 3 C1 [C1-T1]
5
3 NaN Test 5- C2 [C2-T0, C2-T1]
6
Now, we can format them correctly into strings (step 2):
JavaScript
1
2
1
output['Town'] = pd.Series([', '.join([y for y in x if type(y) == str]) for x in output['Town']]).replace('',np.nan)
2
Which outputs our desired result:
JavaScript
1
6
1
Time Source Level County Town
2
0 2021-12-01 10:01:41.443 NaN NaN NaN NaN
3
1 2021-12-01 10:01:46.452 NaN NaN NaN NaN
4
2 NaN Test 3 C1 C1-T1
5
3 NaN Test 5- C2 C2-T0, C2-T1
6