Skip to content
Advertisement

Replicate a function from pandas into pyspark

I am trying to execute the same function on a spark dataframe rather than pandas.

JavaScript

Advertisement

Answer

A direct translation would require you to do multiple collect for each column calculation. I suggest you do all calculations for columns in the dataframe as a single row and then collect that row. Here’s an example.

JavaScript

Calculate percentage of whitespace values and number of null values for all columns.

JavaScript

We can convert the calculated fields as a dictionary for easy use in the lista creation.

JavaScript

Use the calc_dict in the lista creation.

JavaScript
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement