I am trying to find a way to calculate the overlapping (intersection) and non-overlapping areas of two distributions. Based on some posts (like this: FINDING AREA) I could figure out how to calculate the intersection of the two plots. But I had no luck finding the non-overlapping part.
Given this example distributions:
import pandas as pd
import numpy as np
import seaborn as sns
data1=np.random.normal(0, 0.1, 100)
data2=np.random.normal(0, 0.3, 100)
x0=pd.Series(data1)
x1=pd.Series(data2)
kwargs = dict(hist_kws={'alpha':.01}, kde_kws={'linewidth':2})
sns.distplot(x0,bins=3, color="dodgerblue", **kwargs)
sns.distplot(x1,bins=3, color="dodgerblue", **kwargs)
I wonder how I can calculate the areas between the two distributions at the edges?
I should mention that I have the data as well (if that is something that I could do with the data itself).
Advertisement
Answer
Taking inspiration from the answer you linked, from data you should compute kde curves through scipy.stats.gaussian_kde
:
data1 = np.random.normal(0, 0.1, 100)
data2 = np.random.normal(0, 0.3, 100)
kde1 = gaussian_kde(data1, bw_method = 0.5)
kde2 = gaussian_kde(data2, bw_method = 0.5)
xmin = min(data1.min(), data2.min())
xmax = max(data1.max(), data2.max())
dx = 0.2*(xmax - xmin)
xmin -= dx
xmax += dx
x = np.linspace(xmin, xmax, 1000)
kde1_x = kde1(x)
kde2_x = kde2(x)
Then you need to find intersection points between the kde curves:
idx = np.argwhere(np.diff(np.sign(kde1_x - kde2_x))).flatten()
where idx
is the list of index of intersection points.
Finally, you can compute area through numpy.trapz
, using index previously computed to slice x
, kde1_x
and kde2_x
:
area1 = np.trapz(kde2_x[:idx[0]] - kde1_x[:idx[0]], x[:idx[0]]) # area under the lower kde, from the first leftmost point to the first intersection point
area2 = np.trapz(kde2_x[idx[1]:] - kde1_x[idx[1]:], x[idx[1]:]) # area under the lower kde, from the second intersection point to the last rightmost point
area3 = np.trapz(np.minimum(kde1_x, kde2_x), x) # intersection area between of the kde curves, between the two intersection points
area4 = np.trapz(kde1_x[idx[0]:idx[1]] - kde2_x[idx[0]:idx[1]], x[idx[0]:idx[1]]) # area under the highest kde, excluding the lower one, between the two intersection points
Complete Code
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
data1 = np.random.normal(0, 0.1, 100)
data2 = np.random.normal(0, 0.3, 100)
kde1 = gaussian_kde(data1, bw_method = 0.5)
kde2 = gaussian_kde(data2, bw_method = 0.5)
xmin = min(data1.min(), data2.min())
xmax = max(data1.max(), data2.max())
dx = 0.2*(xmax - xmin)
xmin -= dx
xmax += dx
x = np.linspace(xmin, xmax, 1000)
kde1_x = kde1(x)
kde2_x = kde2(x)
idx = np.argwhere(np.diff(np.sign(kde1_x - kde2_x))).flatten()
area1 = np.trapz(kde2_x[:idx[0]] - kde1_x[:idx[0]], x[:idx[0]]) # area under the lower kde, from the first leftmost point to the first intersection point
area2 = np.trapz(kde2_x[idx[1]:] - kde1_x[idx[1]:], x[idx[1]:]) # area under the lower kde, from the second intersection point to the last rightmost point
area3 = np.trapz(np.minimum(kde1_x, kde2_x), x) # intersection area between of the kde curves, between the two intersection points
area4 = np.trapz(kde1_x[idx[0]:idx[1]] - kde2_x[idx[0]:idx[1]], x[idx[0]:idx[1]]) # area under the highest kde, excluding the lower one, between the two intersection points
fig, ax = plt.subplots()
ax.plot(x, kde1_x, color = 'dodgerblue', linewidth = 2)
ax.plot(x, kde2_x, color = 'orangered', linewidth = 2)
ax.fill_between(x[:idx[0]], kde2_x[:idx[0]], kde1_x[:idx[0]], color = 'dodgerblue', alpha = 0.3, label = 'area1')
ax.fill_between(x[idx[1]:], kde2_x[idx[1]:], kde1_x[idx[1]:], color = 'orangered', alpha = 0.3, label = 'area2')
ax.fill_between(x, np.minimum(kde1_x, kde2_x), 0, color = 'lime', alpha = 0.3, label = 'area3')
ax.fill_between(x[idx[0]:idx[1]], kde1_x[idx[0]:idx[1]], kde2_x[idx[0]:idx[1]], color = 'gold', alpha = 0.3, label = 'area4')
ax.plot(x[idx], kde2_x[idx], 'ko')
handles, labels = ax.get_legend_handles_labels()
labels[0] += f': {area1 * 100:.1f}%'
labels[1] += f': {area2 * 100:.1f}%'
labels[2] += f': {area3 * 100:.1f}%'
labels[3] += f': {area4 * 100:.1f}%'
ax.legend(handles, labels)
plt.show()