I have a DataFrame like this one:
JavaScript
x
15
15
1
customer_type age satisfaction design food wifi service distance
2
Loyal 28 Not Satisfied 0 1 2 2 13.5
3
Loyal 55 Satisfied 5 3 5 4 34.2
4
Disloyal 36 Not Satisfied 2 0 2 4 55.8
5
Disloyal 28 Not Satisfied 3 1 2 2 13.5
6
Disloyal 33 Not Satisfied 2 1 2 2 13.5
7
Disloyal 35 Not Satisfied 2 1 2 2 13.5
8
Disloyal 39 Not Satisfied 1 1 2 2 13.5
9
Disloyal 31 Not Satisfied 2 1 2 2 13.5
10
Loyal 28 Not Satisfied 0 1 2 2 13.5
11
Disloyal 31 Not Satisfied 2 1 2 2 13.5
12
Disloyal 40 Not Satisfied 2 1 2 2 13.5
13
Disloyal 35 Not Satisfied 2 1 2 2 13.5
14
Disloyal 35 Not Satisfied 2 2 2 2 13.5
15
I want to find out the characteristics of the Disloyal
and Not Satisfied
customers that are between 30 and 40 years old, grouping them by the service they have rated:
JavaScript
1
8
1
service ratings_count age age_count population_pct
2
design 8 40 1 7.69
3
36 1 7.69
4
35 3 23.07
5
33 1 7.69
6
31 2 15.38
7
food 1 35 1 7.69
8
I suspect I have to use melt
but I can’t figure out how to groupby
from there.
Advertisement
Answer
With the following toy dataframe, inspired by yours but a bit more heterogeneous:
JavaScript
1
57
57
1
import pandas as pd
2
3
df = pd.DataFrame(
4
{
5
"customer_type": [
6
"Loyal",
7
"Loyal",
8
"Disloyal",
9
"Disloyal",
10
"Loyal",
11
"Disloyal",
12
"Disloyal",
13
"Disloyal",
14
"Loyal",
15
"Disloyal",
16
"Disloyal",
17
"Loyal",
18
"Disloyal",
19
],
20
"age": [28, 55, 27, 31, 42, 35, 39, 31, 28, 51, 40, 35, 35],
21
"satisfaction": [
22
"Not Satisfied",
23
"Satisfied",
24
"Not Satisfied",
25
"Not Satisfied",
26
"Not Satisfied",
27
"Not Satisfied",
28
"Not Satisfied",
29
"Not Satisfied",
30
"Not Satisfied",
31
"Satisfied",
32
"Not Satisfied",
33
"Not Satisfied",
34
"Satisfied",
35
],
36
"design": [0, 5, 2, 4, 2, 2, 1, 0, 1, 2, 0, 2, 1],
37
"food": [1, 3, 0, 0, 1, 0, 1, 0, 1, 0, 1, 3, 2],
38
"wifi": [2, 5, 2, 1, 2, 2, 0, 0, 1, 2, 2, 0, 2],
39
"service": [2, 4, 1, 2, 2, 2, 0, 2, 3, 0, 1, 2, 2],
40
"distance": [
41
13.5,
42
34.2,
43
55.8,
44
21.2,
45
23.5,
46
13.5,
47
9.0,
48
13.5,
49
13.5,
50
22.7,
51
13.5,
52
13.5,
53
18.4,
54
],
55
}
56
)
57
Here is one less subtel, although easier, way to do it:
JavaScript
1
61
61
1
# Setup
2
amenities = ["design", "food", "wifi", "service"]
3
dfs = []
4
new_df = df[
5
df["age"].between(30, 40)
6
& (df["customer_type"] == "Disloyal")
7
& (df["satisfaction"] == "Not Satisfied")
8
]
9
10
# Iterate on each amenity, groupby age and add other columns/values
11
for amenity in amenities:
12
temp_df = (
13
new_df[new_df[amenity] > 0]
14
.groupby("age")
15
.agg({amenity: "count", "distance": "mean"})
16
.reset_index(drop=False)
17
.rename(columns={amenity: "age_count", "distance": "distance_mean"})
18
.pipe(
19
lambda df_: df_.assign(
20
population_pct=(100 * df_["age_count"] / df_["age_count"].sum()).round(
21
2
22
)
23
)
24
)
25
)
26
temp_df.loc[0, "amenities"] = amenity
27
temp_df.loc[0, "ratings_count"] = temp_df["age_count"].sum()
28
temp_df = pd.concat(
29
[
30
temp_df,
31
new_df[(new_df[amenity] != 0)]
32
.sort_values(by=["age"])
33
.groupby("age")
34
.agg({amenity: list})
35
.reset_index(drop=True),
36
],
37
axis=1,
38
)
39
temp_df = temp_df.rename(columns={amenity: "ratings"})
40
temp_df["ratings"] = temp_df["ratings"].apply(lambda x: x[0] if len(x) == 1 else x)
41
dfs.append(temp_df)
42
43
# Get final dataframe and cleanup
44
new_df = (
45
pd.concat(dfs)
46
.fillna(method="ffill")
47
.reindex(
48
columns=[
49
"amenities",
50
"ratings_count",
51
"age",
52
"age_count",
53
"ratings",
54
"distance_mean",
55
"population_pct",
56
]
57
)
58
.astype({"ratings_count": "int"})
59
.set_index(["amenities", "ratings_count"])
60
)
61
So that:
JavaScript
1
16
16
1
print(new_df)
2
# Output
3
age age_count ratings distance_mean population_pct
4
amenities ratings_count
5
design 3 31 1 4 21.20 33.33
6
3 35 1 2 13.50 33.33
7
3 39 1 1 9.00 33.33
8
food 2 39 1 1 9.00 50.00
9
2 40 1 1 13.50 50.00
10
wifi 3 31 1 1 21.20 33.33
11
3 35 1 2 13.50 33.33
12
3 40 1 2 13.50 33.33
13
service 4 31 2 [2, 2] 17.35 50.00
14
4 35 1 2 13.50 25.00
15
4 40 1 1 13.50 25.00
16
From here, you can filter ratings_count <=2
like this:
JavaScript
1
7
1
print(new_df[new_df.index.get_level_values(1) <= 2])
2
# Output
3
age age_count ratings distance_mean population_pct
4
amenities ratings_count
5
food 2 39 1 1 9.0 50.0
6
2 40 1 1 13.5 50.0
7