Skip to content
Advertisement

How to visualize categorical frequency difference

Data: Diabetes dataset found here: https://raw.githubusercontent.com/LahiruTjay/Machine-Learning-With-Python/master/datasets/diabetes.csv

Objective: I want to examine how many people under the Age of 30 have diabetes, which is indicated by a 1 or 0 in the “Outcome” column of the dataset and plot it to see if there is a class imbalance (more of 1 or more of 2 or roughly equal?)

Method:

  1. Filter my dataset as such:
ages_under30 = data.loc[data.Age < 30].loc[:,["Age"]]
outcome_under30 = data.loc[data.Age < 30].loc[:,["Outcome"]]

This successfully returns all of the ages under 30 and what the person’s outcome is (0 or 1).

  1. I want to plot the points to see if what the class representation looks like. Are there certain ages that are more diabetic? X-axis would be “ages_under30” and y-axis would be “outcome_under30”.
plt.grid()
plt.xlabel("Age")
plt.ylabel("Diabetic?")
plt.plot(age_under30, outcome_under30, "o")

enter image description here

See figure above. This is where I need help. You cant really make heads or tails of this. There is a class imbalance in this age group – infact 312 samples are not-diabetic while only 84 are. How can I adjust the plot to better depict this class imbalance?

Advertisement

Answer

  • The difference in 'Outcome' for each 'Age' can most easily be seen with a bar plot showing the count, which can be done directly with a seaborn.countplot, or calculating the counts in pandas, and plotting with pandas.DataFrmame.plot.
  • Tested in python 3.8.12, pandas 1.3.3, matplotlib 3.4.3, seaborn 0.11.2

Data and Imports

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# data
df = pd.read_csv('https://raw.githubusercontent.com/LahiruTjay/Machine-Learning-With-Python/master/datasets/diabetes.csv')

# filter for less than 30
u30 = df[df.Age.lt(30)]

Use seaborn.coutplot

  • Directly show the counts of observations in each categorical bin using bars.
  • This can also be down with seaborn.catplot and kind='count', which creates a figure-level plot
sns.countplot(data=u30, x='Age', hue='Outcome')

enter image description here

Use pandas.crosstab and pandas.DataFrame.plot

  • Use .crosstab to compute a frequency table between 'Age' and 'Outcome'.
    • This can also be done with groupby, but then the dataframe requires further manipulation for plotting.
# reshape the dataframe
ct = pd.crosstab(u30.Age, u30.Outcome)

# plot
ct.plot(kind='bar', rot=0)

enter image description here

Data

  • Incase the data at the GitHub link is no longer available
Age,Outcome
21,0
26,1
29,0
27,0
29,1
22,0
28,1
22,0
28,0
27,1
26,0
25,1
29,0
22,0
24,0
22,0
26,0
21,0
22,0
21,0
24,0
25,0
27,0
28,1
26,0
23,0
22,0
22,0
27,0
26,1
24,0
22,0
22,0
22,0
27,0
26,0
24,0
21,0
21,0
24,0
22,0
23,0
22,0
21,0
24,0
27,0
21,0
27,0
25,0
24,1
24,1
23,0
25,0
25,0
22,0
21,0
25,1
24,0
23,0
23,1
26,1
23,0
26,0
21,0
22,0
29,0
28,0
22,0
23,0
21,0
22,0
24,0
23,0
21,0
23,0
22,0
27,0
21,0
22,0
29,0
29,0
29,1
25,0
23,0
26,1
23,0
21,0
27,0
25,1
21,0
29,1
21,0
23,1
26,1
29,1
21,0
28,0
27,0
27,0
21,0
25,0
24,0
24,1
25,1
21,1
26,0
22,0
26,0
24,1
24,0
22,1
22,0
29,0
23,0
26,1
23,1
27,0
21,0
22,0
22,1
29,0
23,0
23,0
27,0
24,0
25,0
21,1
25,0
24,0
27,1
24,0
25,1
24,0
21,0
28,1
21,0
21,0
25,0
29,1
23,0
22,0
28,1
29,1
26,0
21,0
25,1
24,1
28,0
29,1
24,0
25,1
28,1
29,0
21,0
25,1
22,0
27,1
25,0
26,0
29,1
28,0
25,1
21,0
24,0
23,1
25,0
22,0
26,0
22,0
22,0
22,0
23,0
26,0
29,0
24,0
21,0
28,1
29,1
29,1
29,1
21,0
22,0
25,1
21,0
21,0
25,0
28,0
22,0
22,0
24,0
22,0
21,0
25,0
25,0
24,0
28,0
27,1
21,0
25,0
22,1
25,0
25,1
26,0
25,0
28,1
28,0
25,0
22,0
21,0
21,1
22,1
22,0
27,0
28,1
26,0
21,0
21,0
21,0
25,0
26,0
23,0
22,0
29,0
29,1
28,0
21,0
22,0
24,0
25,1
28,0
26,0
22,1
26,0
23,0
23,1
25,0
24,0
24,0
26,0
21,0
22,0
25,0
27,0
28,0
22,0
22,0
24,0
29,1
29,0
28,0
23,0
24,1
21,0
28,0
24,0
22,0
25,0
21,0
28,0
21,0
21,0
21,0
22,0
24,0
28,1
25,0
26,0
26,0
24,0
21,0
21,0
24,0
22,0
22,0
24,0
29,0
24,0
23,1
23,0
27,1
25,0
29,0
28,0
21,0
25,0
23,0
28,0
28,1
24,0
27,0
22,0
21,0
21,0
22,0
22,0
23,0
25,0
21,1
21,1
27,0
22,0
29,0
25,0
24,0
25,0
22,1
21,0
26,0
24,0
28,0
21,0
22,1
25,0
27,0
23,0
24,0
26,0
27,0
23,0
24,1
28,0
28,0
21,0
21,0
29,0
21,0
21,0
21,0
24,0
23,0
22,0
23,0
28,0
27,0
24,0
27,0
22,1
23,0
23,0
27,0
28,0
27,0
22,0
25,1
22,0
27,1
22,1
24,0
21,0
22,0
25,0
25,1
23,0
22,0
26,1
22,0
27,1
25,0
22,0
29,0
23,0
23,0
25,0
22,0
28,0
26,0
26,0
27,0
28,0
22,0
23,1
24,0
21,0
24,0
21,0
25,0
22,0
22,0
22,0
22,1
24,1
22,0
28,0
21,0
21,0
26,0
22,0
27,1
22,1
28,0
25,0
26,1
26,0
22,0
27,0
23,0
Advertisement