Importing count() data for use within bokeh

I am trying to create a visualisation using the bokeh package which I have imported into the Databricks environment. I have transformed the data from a raw data frame into something resembling the following (albeit much larger):

columns = ['month', 'title']
data = [('2020-05', 'Paper 1'), ('2020-05', 'Paper 2'), ('2020-03', 'Paper 3'), ('2020-02', 'Paper 4'), ('2020-01', 'Paper 5')]

From there, I wish to create a line graph using the bokeh package to show the number of papers released per month (for the last 12 months). I have started using the code below:

df = df.groupBy('month').count().orderBy('month', ascending = False).limit(12)
df = df.orderBy('month', ascending = True)

Which has produced the table of results I need in the correct order. However, when I use the code below to try to transform the resulting data (from the df above) into the line plot, I am receiving an error.

The code:

Month = []
Papers = []

for row in df.rdd.collect():
  Month.append(row.month)
  Papers.append(int(row.count))
    
print(Month)
print(Papers)

p = figure(title="Graph to show the release of new papers from January 2020", x_axis_label="Month", y_axis_label="Year")

p.line(Month, Papers, line_width=2)
show(p)

The error:

TypeError: int() argument must be a string, a bytes-like object or a number, not 'builtin_function_or_method'

Now, I can only assume this is because I am trying to use the ‘count’ column, created by a ‘built in function’ to create the variables for my plot. My question is, is there a different way to approach the creating of my table of results so that bokeh recognising this ‘count’ column as a string or a int, instead of a built-in function?

Answer

count is a method of a Row, so you can’t get the count column of the Row using the dot notation. Instead, you can use the square brackets notation, e.g.

for row in df.rdd.collect():
  Month.append(row['month'])
  Papers.append(int(row['count']))

Advertisement

Answer