I am trying to create a visualisation using the bokeh package which I have imported into the Databricks environment. I have transformed the data from a raw data frame into something resembling the following (albeit much larger):
columns = ['month', 'title'] data = [('2020-05', 'Paper 1'), ('2020-05', 'Paper 2'), ('2020-03', 'Paper 3'), ('2020-02', 'Paper 4'), ('2020-01', 'Paper 5')]
From there, I wish to create a line graph using the bokeh package to show the number of papers released per month (for the last 12 months). I have started using the code below:
df = df.groupBy('month').count().orderBy('month', ascending = False).limit(12) df = df.orderBy('month', ascending = True)
Which has produced the table of results I need in the correct order. However, when I use the code below to try to transform the resulting data (from the df above) into the line plot, I am receiving an error.
The code:
Month = [] Papers = [] for row in df.rdd.collect(): Month.append(row.month) Papers.append(int(row.count)) print(Month) print(Papers) p = figure(title="Graph to show the release of new papers from January 2020", x_axis_label="Month", y_axis_label="Year") p.line(Month, Papers, line_width=2) show(p)
The error:
TypeError: int() argument must be a string, a bytes-like object or a number, not 'builtin_function_or_method'
Now, I can only assume this is because I am trying to use the ‘count’ column, created by a ‘built in function’ to create the variables for my plot. My question is, is there a different way to approach the creating of my table of results so that bokeh recognising this ‘count’ column as a string or a int, instead of a built-in function?
Advertisement
Answer
count
is a method of a Row
, so you can’t get the count
column of the Row
using the dot notation. Instead, you can use the square brackets notation, e.g.
for row in df.rdd.collect(): Month.append(row['month']) Papers.append(int(row['count']))