I’m trying to use Dask instead of pandas since the data size I’m analyzing is quite large. I wanted to add a flag column based on several conditions.
JavaScript
x
3
1
import dask.array as da
2
data['Flag'] = da.where((data['col1']>0) & (data['col2']>data['col4'] | data['col3']>data['col4']), 1, 0).compute()
3
But, then I got the following error message. The above code works perfectly when using np.where
with pandas dataframe, but didn’t work with dask.array.where
.
Advertisement
Answer
If numpy works and the operation is row-wise, then one solution is to use .map_partitions
:
JavaScript
1
6
1
def create_flag(data):
2
data['Flag'] = np.where((data['col1']>0) & (data['col2']>data['col4'] | data['col3']>data['col4']), 1, 0)
3
return data
4
5
ddf = ddf.map_partitions(create_flag)
6