Objective
To merge df_labelled file with a portion of labelled points to df where contains all the point.
What I have tried
Referring to Simple way to Dask concatenate (horizontal, axis=1, columns), I tried the code below
df = df.repartition(npartitions=200) df = df.reset_index(drop=True) df_labelled = df_labelled.repartition(npartitions=200) df_labelled = df_labelled.reset_index(drop=True) df = df.assign(label = df_labelled.label) df.head()
But I get the error
ValueError: Not all divisions are known, can’t align partitions. Please use
set_indexto set the index.
Another thing I have tried is to do left join of the table, but I got NaN for all label, can you explain what I did wrong?
result = dd.merge(df, df_labelled, on=['x', 'y', 'z','R', 'G', 'B'], how="left")
result.head()
    x               y               z           R   G   B   label
0   39020.470000    33884.200003    36.445701   25  39  26  NaN
1   39132.740002    33896.049994    30.405700   19  24  18  NaN
2   39221.059994    33787.050001    26.605700   115 145 145 NaN
Is there anyway I can achieve the expected result as below? I can’t run in Pandas because there are a lot of points which will cause memory issue in Pandas.
Data
df (This file has all points)
x y z R G B 0 39047.700012 33861.890015 48.115704 7 18 12 1 39044.110016 33860.150024 47.135700 14 28 15 2 39049.280029 33861.950073 49.405701 30 58 33 3 39029.030000 33937.689993 48.425700 152 154 143 4 39066.980000 33937.870001 49.725699 209 218 225 5 39069.810002 33795.460001 42.405699 113 136 154
df_labelled (This file contains a portion of labelled points)
x y z R G B label 0 39047.700012 33861.890015 48.115704 7 18 12 14 1 39044.110016 33860.150024 47.135700 14 28 15 14 2 39049.280029 33861.950073 49.405701 30 58 33 14
Expected outcome
x y z R G B label 0 39047.700012 33861.890015 48.115704 7 18 12 14 1 39044.110016 33860.150024 47.135700 14 28 15 14 2 39049.280029 33861.950073 49.405701 30 58 33 14 3 39029.030000 33937.689993 48.425700 152 154 143 nan 4 39066.980000 33937.870001 49.725699 209 218 225 nan 5 39069.810002 33795.460001 42.405699 113 136 154 nan
Advertisement
Answer
I think when you do something like this then error:
df = df.assign(label = df_labelled.label)
because there is no index in dataframe df or/and df_labelled. And Dask doesn’t support multiple index as Pandas. Instead of using index, define the left key and right key if you have more than one key to merge dataframe in Dask. This one is works for me :
result = dd.merge(df, df_labelled, left_on=['x', 'y', 'z','R', 'G', 'B'], right_on = ['x', 'y', 'z','R', 'G', 'B'], suffixes=['_1', '_2'], how="left")