I’m trying to use Dask instead of pandas since the data size I’m analyzing is quite large. I wanted to add a flag column based on several conditions.
import dask.array as da data['Flag'] = da.where((data['col1']>0) & (data['col2']>data['col4'] | data['col3']>data['col4']), 1, 0).compute()
But, then I got the following error message. The above code works perfectly when using np.where
with pandas dataframe, but didn’t work with dask.array.where
.
Advertisement
Answer
If numpy works and the operation is row-wise, then one solution is to use .map_partitions
:
def create_flag(data): data['Flag'] = np.where((data['col1']>0) & (data['col2']>data['col4'] | data['col3']>data['col4']), 1, 0) return data ddf = ddf.map_partitions(create_flag)