The dask map_partitions function in the code below has a dask date field where its month is compared to an integer. This comparison fails with the following error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
What is this error and how to fix it?
import pandas as pd import dask import dask.dataframe as dd import datetime pdf = pd.DataFrame({ 'id2': [1, 1, 1, 2, 2], 'balance': [150, 140, 130, 280, 260], 'date2' : [datetime.datetime(2021,3,1), datetime.datetime(2021,4,1), datetime.datetime(2021,5,1), datetime.datetime(2021,1,1), datetime.datetime(2021,2,1)] }) ddf = dd.from_pandas(pdf, npartitions=1) def func2(obj): m = obj.date2.dt.month if m > 10: return 1 else: return 2 ddf2 = ddf.map_partitions(func2, meta=int) ddf2.compute() # <-- fails here
Advertisement
Answer
By using .map_partition
, each dask dataframe partition (which is a pandas dataframe) is passed to the function func2
. As a result, obj.date2.dt.month
refers to a Series, not a single value, so by running the comparison with the integer, it’s not clear to Python whether how to determine the validity of the comparison.
As one option, below is a snippet that creates a new column, conditional on dt.month
result:
import pandas as pd import dask import dask.dataframe as dd import datetime pdf = pd.DataFrame({ 'id2': [1, 1, 1, 2, 2], 'balance': [150, 140, 130, 280, 260], 'date2' : [datetime.datetime(2021,3,1), datetime.datetime(2021,4,1), datetime.datetime(2021,5,1), datetime.datetime(2021,1,1), datetime.datetime(2021,2,1)] }) ddf = dd.from_pandas(pdf, npartitions=1) def func2(obj): m = obj.date2.dt.month obj.loc[m>10, 'new_int']=1 obj.loc[m<=10, 'new_int']=2 return obj ddf2 = ddf.map_partitions(func2) ddf2.compute()