Skip to content
Advertisement

Connecting means in seaborn box plot

I want to connect box plot means. I can do the basic part but cannot connect box plot means and box plots offset from x axis. similar post but not connecting means Python: seaborn pointplot and boxplot in one plot but shifted on the x-axis

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy','Jason', 'Molly', 'Tina', 'Jake', 'Amy','Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
        'pre_score': [4, 24, 31, 2, 3,25, 94, 57, 62, 70,5, 43, 23, 23, 51]
        }

data = pd.DataFrame(raw_data, columns = ['first_name', 'pre_score'])


 first_name  pre_score
0       Jason          4
1       Molly         24
2        Tina         31
3        Jake          2
4         Amy          3
5       Jason         25
6       Molly         94
7        Tina         57
8        Jake         62
9         Amy         70
10      Jason          5
11      Molly         43
12       Tina         23
13       Jake         23
14        Amy         51

sns.set_style("ticks")
ax = sns.stripplot(x='first_name', y='pre_score', hue='first_name', jitter=True, dodge=True, size=6, zorder=0, alpha=0.5, linewidth =1, data=data)
ax = sns.boxplot(x='first_name', y='pre_score', hue='first_name', dodge=True, showfliers=True, linewidth=0.8, showmeans=True, data=data)
ax = sns.lineplot(x='first_name', y='pre_score', color='k', data=data.groupby(['first_name'], as_index=False).mean())
fig_size = [18.0, 10.0]
plt.rcParams["figure.figsize"] = fig_size 
handles, labels = ax.get_legend_handles_labels()
legend_len = labels.__len__()
ax.legend(handles[int(legend_len/2):legend_len], labels[int(legend_len/2):legend_len], bbox_to_anchor=(1.01, 1), loc=2, borderaxespad=0.1); 

As we can see the sns.line plot does not follow the means and box plots and names in the x axis has offset.

How can I fix this ?

enter image description here

Advertisement

Answer

When dealing with seaborn plot, I would strongly recommend you always provide an order= (and hue_order= if applicable) to avoid nasty surprise with the categories not showing up in a consistent order between calls.

For the purpose of your question, you can replace the lineplot with a pointplot, which will automatically aggregate the values by categories and plot using a line

raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy','Jason', 'Molly', 'Tina', 'Jake', 'Amy','Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
        'pre_score': [4, 24, 31, 2, 3,25, 94, 57, 62, 70,5, 43, 23, 23, 51]
        }


data = pd.DataFrame(raw_data, columns = ['first_name', 'pre_score'])
# define the order in which the categories will be plotted on the x-axis
order = np.sort(data['first_name'].unique()) # you could also create a list by hand if you want a specific order

sns.set_style("ticks")
ax = sns.stripplot(x='first_name', y='pre_score', order=order, jitter=True, size=6, zorder=0, alpha=0.5, linewidth =1, data=data)
ax = sns.boxplot(x='first_name', y='pre_score', order=order, showfliers=True, linewidth=0.8, showmeans=True, data=data)
ax = sns.pointplot(x='first_name', y='pre_score', order=order, data=data, ci=None, color='black')

enter image description here

If for some reason you don’t want to or cannot use a seaborn function that takes an order argument, then aggregate by hand in pandas, and reindex() with your order to make sure the values appear in the right order in the dataframe before plotting with the tool of your choice.

For instance, you could replace the call to pointplot() above with:

means = data.groupby('first_name')['pre_score'].mean().reindex(order) # calculate the means and ensure they are 
                                                                      # displayed in the same order as the boxplots
ax.plot(means.index, means.values, 'ko-', lw=3)

and have the exact same result

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement