I was trying to plot some data from a pandas dataframe. My table contains 10000ish films and for each of them two info: the year it got published, and a rating from 0 to 3. I am having a hard time trying to plot a graph with the pandas library that shows the number of films that received a particular rating (3 in my case) every year.
I have tried to use .value_counts(), but it didn’t work as i hoped, since I can’t isolate a single value, maintaining the rating linked to its year.
I really hoped i decently explained my problem, since it is the first time i ask help on stack overflow.
This is the code i used to get my dataframe, if it is useful in any way.
import json
import requests
import pandas as pd
import numpy as np
request = requests.get("http://bechdeltest.com/api/v1/getAllMovies").json()
data = pd.DataFrame(request)
P.S. Thank you for the precious help!
Advertisement
Answer
You can filter by rating and use Series.value_counts:
s = data.loc[data['rating'].eq(3), 'year'].value_counts()
But there is many years of films:
print (len(s)) 108
So for plot I filter only counts greatwer like 30, it is 40 years here:
print (s.gt(30).sum()) 40
So filter again and plot:
s[s.gt(30)].plot.bar()
EDIT: Solution with percentages:
s=data.loc[data['rating'].eq(3),'year'].value_counts(normalize=True).sort_index().mul(100) print (s) 1899 0.018218 1910 0.018218 1916 0.054655 1917 0.054655 1918 0.054655 2018 3.169976 2019 3.188195 2020 2.040445 2021 1.840044 2022 0.765167 Name: year, Length: 108, dtype: float64 print (s[s.gt(3)]) 2007 3.042449 2009 3.588996 2010 3.825833 2011 4.299508 2012 4.153762 2013 4.937147 2014 4.335945 2015 3.771179 2016 3.752960 2017 3.388595 2018 3.169976 2019 3.188195 Name: year, dtype: float64 s[s.gt(3)].plot.bar()
EDIT1: Here is solution for count years vs ratings:
df = pd.crosstab(data['year'], data.rating)
print (df)
rating 0 1 2 3
year
1874 1 0 0 0
1877 1 0 0 0
1878 2 0 0 0
1881 1 0 0 0
1883 1 0 0 0
.. .. .. ...
2018 19 44 24 174
2019 16 47 18 175
2020 10 17 11 112
2021 11 22 13 101
2022 3 14 5 42
[141 rows x 4 columns]
EDIT2:
df = pd.crosstab(data['year'], data.rating, normalize='index').mul(100)
print (df)
rating 0 1 2 3
year
1874 100.000000 0.000000 0.000000 0.000000
1877 100.000000 0.000000 0.000000 0.000000
1878 100.000000 0.000000 0.000000 0.000000
1881 100.000000 0.000000 0.000000 0.000000
1883 100.000000 0.000000 0.000000 0.000000
... ... ... ...
2018 7.279693 16.858238 9.195402 66.666667
2019 6.250000 18.359375 7.031250 68.359375
2020 6.666667 11.333333 7.333333 74.666667
2021 7.482993 14.965986 8.843537 68.707483
2022 4.687500 21.875000 7.812500 65.625000
[141 rows x 4 columns]
There is alot of values, here is e.g. filter for column 3 for more like 60% values:
print (df[3].gt(60).sum()) 26 df[df[3].gt(60)].plot.bar()