Data: Diabetes dataset found here: https://raw.githubusercontent.com/LahiruTjay/Machine-Learning-With-Python/master/datasets/diabetes.csv
Objective: I want to examine how many people under the Age of 30 have diabetes, which is indicated by a 1 or 0 in the “Outcome” column of the dataset and plot it to see if there is a class imbalance (more of 1 or more of 2 or roughly equal?)
Method:
- Filter my dataset as such:
ages_under30 = data.loc[data.Age < 30].loc[:,["Age"]] outcome_under30 = data.loc[data.Age < 30].loc[:,["Outcome"]]
This successfully returns all of the ages under 30 and what the person’s outcome is (0 or 1).
- I want to plot the points to see if what the class representation looks like. Are there certain ages that are more diabetic? X-axis would be “ages_under30” and y-axis would be “outcome_under30”.
plt.grid() plt.xlabel("Age") plt.ylabel("Diabetic?") plt.plot(age_under30, outcome_under30, "o")
See figure above. This is where I need help. You cant really make heads or tails of this. There is a class imbalance in this age group – infact 312 samples are not-diabetic while only 84 are. How can I adjust the plot to better depict this class imbalance?
Advertisement
Answer
- The difference in
'Outcome'
for each'Age'
can most easily be seen with a bar plot showing the count, which can be done directly with aseaborn.countplot
, or calculating the counts in pandas, and plotting withpandas.DataFrmame.plot
. - Tested in
python 3.8.12
,pandas 1.3.3
,matplotlib 3.4.3
,seaborn 0.11.2
Data and Imports
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # data df = pd.read_csv('https://raw.githubusercontent.com/LahiruTjay/Machine-Learning-With-Python/master/datasets/diabetes.csv') # filter for less than 30 u30 = df[df.Age.lt(30)]
Use seaborn.coutplot
- Directly show the counts of observations in each categorical bin using bars.
- This can also be down with
seaborn.catplot
andkind='count'
, which creates a figure-level plot
sns.countplot(data=u30, x='Age', hue='Outcome')
Use pandas.crosstab
and pandas.DataFrame.plot
- Use
.crosstab
to compute a frequency table between'Age'
and'Outcome'
.- This can also be done with groupby, but then the dataframe requires further manipulation for plotting.
# reshape the dataframe ct = pd.crosstab(u30.Age, u30.Outcome) # plot ct.plot(kind='bar', rot=0)
Data
- Incase the data at the GitHub link is no longer available
Age,Outcome 21,0 26,1 29,0 27,0 29,1 22,0 28,1 22,0 28,0 27,1 26,0 25,1 29,0 22,0 24,0 22,0 26,0 21,0 22,0 21,0 24,0 25,0 27,0 28,1 26,0 23,0 22,0 22,0 27,0 26,1 24,0 22,0 22,0 22,0 27,0 26,0 24,0 21,0 21,0 24,0 22,0 23,0 22,0 21,0 24,0 27,0 21,0 27,0 25,0 24,1 24,1 23,0 25,0 25,0 22,0 21,0 25,1 24,0 23,0 23,1 26,1 23,0 26,0 21,0 22,0 29,0 28,0 22,0 23,0 21,0 22,0 24,0 23,0 21,0 23,0 22,0 27,0 21,0 22,0 29,0 29,0 29,1 25,0 23,0 26,1 23,0 21,0 27,0 25,1 21,0 29,1 21,0 23,1 26,1 29,1 21,0 28,0 27,0 27,0 21,0 25,0 24,0 24,1 25,1 21,1 26,0 22,0 26,0 24,1 24,0 22,1 22,0 29,0 23,0 26,1 23,1 27,0 21,0 22,0 22,1 29,0 23,0 23,0 27,0 24,0 25,0 21,1 25,0 24,0 27,1 24,0 25,1 24,0 21,0 28,1 21,0 21,0 25,0 29,1 23,0 22,0 28,1 29,1 26,0 21,0 25,1 24,1 28,0 29,1 24,0 25,1 28,1 29,0 21,0 25,1 22,0 27,1 25,0 26,0 29,1 28,0 25,1 21,0 24,0 23,1 25,0 22,0 26,0 22,0 22,0 22,0 23,0 26,0 29,0 24,0 21,0 28,1 29,1 29,1 29,1 21,0 22,0 25,1 21,0 21,0 25,0 28,0 22,0 22,0 24,0 22,0 21,0 25,0 25,0 24,0 28,0 27,1 21,0 25,0 22,1 25,0 25,1 26,0 25,0 28,1 28,0 25,0 22,0 21,0 21,1 22,1 22,0 27,0 28,1 26,0 21,0 21,0 21,0 25,0 26,0 23,0 22,0 29,0 29,1 28,0 21,0 22,0 24,0 25,1 28,0 26,0 22,1 26,0 23,0 23,1 25,0 24,0 24,0 26,0 21,0 22,0 25,0 27,0 28,0 22,0 22,0 24,0 29,1 29,0 28,0 23,0 24,1 21,0 28,0 24,0 22,0 25,0 21,0 28,0 21,0 21,0 21,0 22,0 24,0 28,1 25,0 26,0 26,0 24,0 21,0 21,0 24,0 22,0 22,0 24,0 29,0 24,0 23,1 23,0 27,1 25,0 29,0 28,0 21,0 25,0 23,0 28,0 28,1 24,0 27,0 22,0 21,0 21,0 22,0 22,0 23,0 25,0 21,1 21,1 27,0 22,0 29,0 25,0 24,0 25,0 22,1 21,0 26,0 24,0 28,0 21,0 22,1 25,0 27,0 23,0 24,0 26,0 27,0 23,0 24,1 28,0 28,0 21,0 21,0 29,0 21,0 21,0 21,0 24,0 23,0 22,0 23,0 28,0 27,0 24,0 27,0 22,1 23,0 23,0 27,0 28,0 27,0 22,0 25,1 22,0 27,1 22,1 24,0 21,0 22,0 25,0 25,1 23,0 22,0 26,1 22,0 27,1 25,0 22,0 29,0 23,0 23,0 25,0 22,0 28,0 26,0 26,0 27,0 28,0 22,0 23,1 24,0 21,0 24,0 21,0 25,0 22,0 22,0 22,0 22,1 24,1 22,0 28,0 21,0 21,0 26,0 22,0 27,1 22,1 28,0 25,0 26,1 26,0 22,0 27,0 23,0