Skip to content
Advertisement

NotImplementedError when calling pandas_profiling.ProfileReport.to_widgets() inside Apache Zeppelin

I’m trying to use the pandas_profiling package to automagically describe some data frames from inside Apaceh Zeppelin.

The code I’m running is:

%pyspark

import sys
print(sys.version_info)

import numpy as np
print("numpy: ", np.__version__)
import pandas as pd
print("pandas: ", pd.__version__)
import pandas_profiling as pp
print("pandas_profiling: ", pp.__version__)

from pandas_profiling import ProfileReport

df = spark.sql("SELECT * FROM database.table")

profile = ProfileReport(df, title="Report: table")

profile.to_widgets()

My result is:

sys.version_info(major=3, minor=6, micro=8, releaselevel='final', serial=0)
numpy:  1.19.5
pandas:  1.1.5
pandas_profiling:  3.1.0


Fail to execute line 19: profile.to_widgets()
Traceback (most recent call last):
  File "/tmp/1662648724242-0/zeppelin_python.py", line 158, in <module>
    exec(code, _zcUserQueryNameSpace)
  File "<stdin>", line 19, in <module>
  File "/usr/local/lib/python3.6/site-packages/pandas_profiling/profile_report.py", line 414, in to_widgets
    display(self.widgets)
  File "/usr/local/lib/python3.6/site-packages/pandas_profiling/profile_report.py", line 197, in widgets
    self._widgets = self._render_widgets()
  File "/usr/local/lib/python3.6/site-packages/pandas_profiling/profile_report.py", line 315, in _render_widgets
    report = self.report
  File "/usr/local/lib/python3.6/site-packages/pandas_profiling/profile_report.py", line 179, in report
    self._report = get_report_structure(self.config, self.description_set)
  File "/usr/local/lib/python3.6/site-packages/pandas_profiling/profile_report.py", line 166, in description_set
    self._sample,
  File "/usr/local/lib/python3.6/site-packages/pandas_profiling/model/describe.py", line 56, in describe
    check_dataframe(df)
  File "/usr/local/lib/python3.6/site-packages/multimethod/__init__.py", line 209, in __call__
    return self[tuple(map(self.get_type, args))](*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/pandas_profiling/model/dataframe.py", line 10, in check_dataframe
    raise NotImplementedError()
NotImplementedError

Any way to work around this? Any hope of working around it from inside Zeppelin?

Advertisement

Answer

The NotImplementedError is being raised from check_dataframe: https://github.com/ydataai/pandas-profiling/blob/v3.1.0/src/pandas_profiling/model/dataframe.py#L10. check_dataframe uses multimethod for enabling multiple argument dispatching to functions, which currently only supports Pandas DataFrames: https://github.com/ydataai/pandas-profiling/blob/v3.1.0/src/pandas_profiling/model/pandas/dataframe_pandas.py#L11. In the code snippet, you are supplying a Spark dataframe (the result from spark.sql(...)), which there doesn’t appear to be any registered function for dynamic dispatch. If you convert the Spark dataframe to a Pandas dataframe using the toPandas method, it should call the correct check_dataframe function:

%pyspark

import sys
print(sys.version_info)

import numpy as np
print("numpy: ", np.__version__)
import pandas as pd
print("pandas: ", pd.__version__)
import pandas_profiling as pp
print("pandas_profiling: ", pp.__version__)

from pandas_profiling import ProfileReport

df = spark.sql("SELECT * FROM database.table").toPandas() 

profile = ProfileReport(df, title="Report: table")

profile.to_widgets()

Alternatively, you can try to register your own function for checking Spark dataframes i.e;

from pandas_profiling.model.dataframe import check_dataframe
from pyspark.sql import DataFrame as SparkDataFrame
@check_dataframe.register
def spark_check_dataframe(df: SparkDataFrame):
   # do something here or just make it a `pass`

but downstream functions in the reporting logic may not be (and are likely not) compatible with Spark dataframes.

Another alternative if you wanted continue working with Spark dataframes due to the scale of the data or level of comfortability with the API, there is spark-df-profiling which is based on pandas profiling but built for handling Spark dataframes.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement