I was trying to plot some data from a pandas dataframe. My table contains 10000ish films and for each of them two info: the year it got published, and a rating from 0 to 3. I am having a hard time trying to plot a graph with the pandas library that shows the number of films that received a particular rating (3 in my case) every year.
I have tried to use .value_counts(), but it didn’t work as i hoped, since I can’t isolate a single value, maintaining the rating linked to its year.
I really hoped i decently explained my problem, since it is the first time i ask help on stack overflow.
This is the code i used to get my dataframe, if it is useful in any way.
import json import requests import pandas as pd import numpy as np request = requests.get("http://bechdeltest.com/api/v1/getAllMovies").json() data = pd.DataFrame(request)
P.S. Thank you for the precious help!
Advertisement
Answer
You can filter by rating
and use Series.value_counts
:
s = data.loc[data['rating'].eq(3), 'year'].value_counts()
But there is many years of films:
print (len(s)) 108
So for plot I filter only counts greatwer like 30, it is 40
years here:
print (s.gt(30).sum()) 40
So filter again and plot:
s[s.gt(30)].plot.bar()
EDIT: Solution with percentages:
s=data.loc[data['rating'].eq(3),'year'].value_counts(normalize=True).sort_index().mul(100) print (s) 1899 0.018218 1910 0.018218 1916 0.054655 1917 0.054655 1918 0.054655 2018 3.169976 2019 3.188195 2020 2.040445 2021 1.840044 2022 0.765167 Name: year, Length: 108, dtype: float64 print (s[s.gt(3)]) 2007 3.042449 2009 3.588996 2010 3.825833 2011 4.299508 2012 4.153762 2013 4.937147 2014 4.335945 2015 3.771179 2016 3.752960 2017 3.388595 2018 3.169976 2019 3.188195 Name: year, dtype: float64 s[s.gt(3)].plot.bar()
EDIT1: Here is solution for count years
vs ratings
:
df = pd.crosstab(data['year'], data.rating) print (df) rating 0 1 2 3 year 1874 1 0 0 0 1877 1 0 0 0 1878 2 0 0 0 1881 1 0 0 0 1883 1 0 0 0 .. .. .. ... 2018 19 44 24 174 2019 16 47 18 175 2020 10 17 11 112 2021 11 22 13 101 2022 3 14 5 42 [141 rows x 4 columns]
EDIT2:
df = pd.crosstab(data['year'], data.rating, normalize='index').mul(100) print (df) rating 0 1 2 3 year 1874 100.000000 0.000000 0.000000 0.000000 1877 100.000000 0.000000 0.000000 0.000000 1878 100.000000 0.000000 0.000000 0.000000 1881 100.000000 0.000000 0.000000 0.000000 1883 100.000000 0.000000 0.000000 0.000000 ... ... ... ... 2018 7.279693 16.858238 9.195402 66.666667 2019 6.250000 18.359375 7.031250 68.359375 2020 6.666667 11.333333 7.333333 74.666667 2021 7.482993 14.965986 8.843537 68.707483 2022 4.687500 21.875000 7.812500 65.625000 [141 rows x 4 columns]
There is alot of values, here is e.g. filter for column 3
for more like 60% values:
print (df[3].gt(60).sum()) 26 df[df[3].gt(60)].plot.bar()