I have a dataframe which results from:
JavaScript
x
7
1
df_grouped = df.groupby(['A', 'B'])['A'].count().sort_values(ascending=False)
2
df_grouped = pd.DataFrame(df_grouped)
3
new_index = pd.MultiIndex.from_tuples(df_grouped.index)
4
df_grouped.index = new_index
5
df_grouped.reset_index(inplace=True)
6
df_grouped.columns = ['A', 'B', 'count']
7
Then, df_grouped
is something like:
A | B | count |
---|---|---|
A_1 | B_1 | 10 |
A_1 | B_2 | 51 |
A_1 | B_3 | 25 |
A_1 | B_4 | 12 |
A_1 | B_5 | 2 |
A_2 | B_1 | 19 |
A_2 | B_3 | 5 |
A_3 | B_5 | 18 |
A_3 | B_4 | 33 |
A_3 | B_5 | 44 |
A_4 | B_1 | 29 |
A_5 | B_2 | 32 |
I have plotted a seaborn.histplot
using the following code:
JavaScript
1
3
1
fig, ax = plt.subplots(1, 1, figsize=(10,5))
2
sns.histplot(x='A', hue='B', data=df_grouped, ax=ax, multiple='stack', weights='count')
3
and results in the following image:
What I would like is to order the plot based on the total counts of each value of A. I have tried different methods, but I am not able to get a successful result.
Edit
I found a way to do what I wanted.
What I did, is to calculate the total counts by df['A']
values:
JavaScript
1
3
1
df['total_count'] = df.groupby(by='A')['count'].transform('sum')
2
df = df.sort_values(by=['total_count'], ascending=False)
3
Then, by using the same plot code from above, I got the desired result.
The answer is similar to what Redox proposed.
In any case, I will try the other options proposed.
Advertisement
Answer
- To be clear, the visualization is a stacked bar chart, it’s not a histogram, as a histrogram represents the distribution of continuous values, while this is the counts of discrete categorical values.
- This answer starts with the raw dataframe, not the dataframe created with
.groupby
.
- The easiest way to do this is create a frequency table of the raw dataframe using
pd.crosstab
, not with.groupby
. - Add a column with the
sum
alongaxis=1
. - Use the new column to sort the dataframe.
- Plot directly with
pandas.DataFrame.plot
usingkind='bar'
andstacked=True
.seaborn.histplot
is not needed, andseaborn
is just a high-level api formatplotlib
pandas
usesmatplotlib
by default for plotting.
- This reduces the code to 4 lines.
- Tested in
python 3.10
,pandas 1.4.2
,matplotlib 3.5.1
,seaborn 0.11.2
JavaScript
1
22
22
1
import numpy as np # used for creating sample data
2
import pandas as pd
3
4
# sample dataframe representing raw data
5
np.random.seed(365)
6
rows = 1100
7
data = {'A': np.random.choice([f'A_{v}' for v in range(1, 6)], size=rows, p=[.35, .05, .25, .15, .2]),
8
'B': np.random.choice([f'B_{v}' for v in range(1, 6)], size=rows, p=[.2, .35, .05, .15, .25])}
9
df = pd.DataFrame(data)
10
11
# 1. frequency counts
12
dfc = pd.crosstab(df.A, df.B)
13
14
# 2. add total column
15
dfc['tot_A'] = dfc.sum(axis=1)
16
17
# 3. sort
18
dfc = dfc.sort_values('tot_A', axis=0, ascending=False)
19
20
# 4. plot the columns except `tot_A`
21
dfc.iloc[:, :-1].plot(kind='bar', stacked=True, figsize=(10, 5), rot=0, width=1, ec='k')
22
Data Views
df
JavaScript
1
7
1
A B
2
0 A_5 B_5
3
1 A_3 B_1
4
2 A_4 B_5
5
3 A_3 B_4
6
4 A_3 B_5
7
dfc
JavaScript
1
8
1
B B_1 B_2 B_3 B_4 B_5 tot_A
2
A
3
A_1 86 131 15 55 90 377
4
A_3 47 90 9 33 61 240
5
A_5 37 83 13 33 56 222
6
A_4 43 65 9 27 50 194
7
A_2 16 21 1 5 24 67
8