Skip to content
Advertisement

Error comparing dask date month with an integer

The dask map_partitions function in the code below has a dask date field where its month is compared to an integer. This comparison fails with the following error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

What is this error and how to fix it?

import pandas as pd
import dask
import dask.dataframe as dd
import datetime

pdf = pd.DataFrame({
    'id2': [1, 1, 1, 2, 2],
    'balance': [150, 140, 130, 280, 260],
    'date2' : [datetime.datetime(2021,3,1), datetime.datetime(2021,4,1), 
               datetime.datetime(2021,5,1), datetime.datetime(2021,1,1), 
               datetime.datetime(2021,2,1)]
})

ddf = dd.from_pandas(pdf, npartitions=1) 

def func2(obj):
    m = obj.date2.dt.month
    if m > 10:
        return 1
    else:
        return 2

ddf2 = ddf.map_partitions(func2, meta=int)
ddf2.compute()   # <-- fails here

Advertisement

Answer

By using .map_partition, each dask dataframe partition (which is a pandas dataframe) is passed to the function func2. As a result, obj.date2.dt.month refers to a Series, not a single value, so by running the comparison with the integer, it’s not clear to Python whether how to determine the validity of the comparison.

As one option, below is a snippet that creates a new column, conditional on dt.month result:

import pandas as pd
import dask
import dask.dataframe as dd
import datetime

pdf = pd.DataFrame({
    'id2': [1, 1, 1, 2, 2],
    'balance': [150, 140, 130, 280, 260],
    'date2' : [datetime.datetime(2021,3,1), datetime.datetime(2021,4,1), 
               datetime.datetime(2021,5,1), datetime.datetime(2021,1,1), 
               datetime.datetime(2021,2,1)]
})

ddf = dd.from_pandas(pdf, npartitions=1) 

def func2(obj):
    m = obj.date2.dt.month
    obj.loc[m>10, 'new_int']=1
    obj.loc[m<=10, 'new_int']=2
    return obj

ddf2 = ddf.map_partitions(func2)
ddf2.compute()
Advertisement