I’m trying to use the pandas_profiling package to automagically describe some data frames from inside Apaceh Zeppelin.
The code I’m running is:
%pyspark import sys print(sys.version_info) import numpy as np print("numpy: ", np.__version__) import pandas as pd print("pandas: ", pd.__version__) import pandas_profiling as pp print("pandas_profiling: ", pp.__version__) from pandas_profiling import ProfileReport df = spark.sql("SELECT * FROM database.table") profile = ProfileReport(df, title="Report: table") profile.to_widgets()
My result is:
sys.version_info(major=3, minor=6, micro=8, releaselevel='final', serial=0) numpy: 1.19.5 pandas: 1.1.5 pandas_profiling: 3.1.0 Fail to execute line 19: profile.to_widgets() Traceback (most recent call last): File "/tmp/1662648724242-0/zeppelin_python.py", line 158, in <module> exec(code, _zcUserQueryNameSpace) File "<stdin>", line 19, in <module> File "/usr/local/lib/python3.6/site-packages/pandas_profiling/profile_report.py", line 414, in to_widgets display(self.widgets) File "/usr/local/lib/python3.6/site-packages/pandas_profiling/profile_report.py", line 197, in widgets self._widgets = self._render_widgets() File "/usr/local/lib/python3.6/site-packages/pandas_profiling/profile_report.py", line 315, in _render_widgets report = self.report File "/usr/local/lib/python3.6/site-packages/pandas_profiling/profile_report.py", line 179, in report self._report = get_report_structure(self.config, self.description_set) File "/usr/local/lib/python3.6/site-packages/pandas_profiling/profile_report.py", line 166, in description_set self._sample, File "/usr/local/lib/python3.6/site-packages/pandas_profiling/model/describe.py", line 56, in describe check_dataframe(df) File "/usr/local/lib/python3.6/site-packages/multimethod/__init__.py", line 209, in __call__ return self[tuple(map(self.get_type, args))](*args, **kwargs) File "/usr/local/lib/python3.6/site-packages/pandas_profiling/model/dataframe.py", line 10, in check_dataframe raise NotImplementedError() NotImplementedError
Any way to work around this? Any hope of working around it from inside Zeppelin?
Advertisement
Answer
The NotImplementedError
is being raised from check_dataframe
: https://github.com/ydataai/pandas-profiling/blob/v3.1.0/src/pandas_profiling/model/dataframe.py#L10. check_dataframe
uses multimethod for enabling multiple argument dispatching to functions, which currently only supports Pandas DataFrames: https://github.com/ydataai/pandas-profiling/blob/v3.1.0/src/pandas_profiling/model/pandas/dataframe_pandas.py#L11. In the code snippet, you are supplying a Spark dataframe (the result from spark.sql(...)
), which there doesn’t appear to be any registered function for dynamic dispatch. If you convert the Spark dataframe to a Pandas dataframe using the toPandas
method, it should call the correct check_dataframe
function:
%pyspark import sys print(sys.version_info) import numpy as np print("numpy: ", np.__version__) import pandas as pd print("pandas: ", pd.__version__) import pandas_profiling as pp print("pandas_profiling: ", pp.__version__) from pandas_profiling import ProfileReport df = spark.sql("SELECT * FROM database.table").toPandas() profile = ProfileReport(df, title="Report: table") profile.to_widgets()
Alternatively, you can try to register your own function for checking Spark dataframes i.e;
from pandas_profiling.model.dataframe import check_dataframe from pyspark.sql import DataFrame as SparkDataFrame @check_dataframe.register def spark_check_dataframe(df: SparkDataFrame): # do something here or just make it a `pass`
but downstream functions in the reporting logic may not be (and are likely not) compatible with Spark dataframes.
Another alternative if you wanted continue working with Spark dataframes due to the scale of the data or level of comfortability with the API, there is spark-df-profiling which is based on pandas profiling but built for handling Spark dataframes.