NotImplementedError when calling pandas_profiling.ProfileReport.to_widgets() inside Apache Zeppelin

Question

I'm trying to use the pandas_profiling package to automagically describe some data frames from inside Apaceh Zeppelin. The code I'm running is: My result is: Any way to work around this? Any hope of working around it from inside Zeppelin? Answer The NotImplementedError is being raised from check_dataframe: https://github.com/ydataai/pandas-profiling/blob/v3.1.0/src/pandas_profiling/model/dataframe.py#L10. check_dataframe uses multimethod for enabling multiple argument dispatching to functions, which

Accepted Answer

The NotImplementedError is being raised from check_dataframe: https://github.com/ydataai/pandas-profiling/blob/v3.1.0/src/pandas_profiling/model/dataframe.py#L10. check_dataframe uses multimethod for enabling multiple argument dispatching to functions, which currently only supports Pandas DataFrames: https://github.com/ydataai/pandas-profiling/blob/v3.1.0/src/pandas_profiling/model/pandas/dataframe_pandas.py#L11. In the code snippet, you are supplying a Spark dataframe (the result from spark.sql(...)), which there doesn&#8217;t appear to be any registered function for dynamic dispatch. If you convert the Spark dataframe to a Pandas dataframe using the toPandas method, it should call the correct check_dataframe function:%pysparkimport sysprint(sys.version_info)import numpy as npprint("numpy: ", np.__version__)import pandas as pdprint("pandas: ", pd.__version__)import pandas_profiling as ppprint("pandas_profiling: ", pp.__version__)from pandas_profiling import ProfileReportdf = spark.sql("SELECT * FROM database.table").toPandas() profile = ProfileReport(df, title="Report: table")profile.to_widgets()Alternatively, you can try to register your own function for checking Spark dataframes i.e;from pandas_profiling.model.dataframe import check_dataframefrom pyspark.sql import DataFrame as SparkDataFrame@check_dataframe.registerdef spark_check_dataframe(df: SparkDataFrame):   # do something here or just make it a `pass`but downstream functions in the reporting logic may not be (and are likely not) compatible with Spark dataframes.Another alternative if you wanted continue working with Spark dataframes due to the scale of the data or level of comfortability with the API, there is spark-df-profiling which is based on pandas profiling but built for handling Spark dataframes.

Advertisement

Answer