Objective
To merge df_labelled
file with a portion of labelled points to df
where contains all the point.
What I have tried
Referring to Simple way to Dask concatenate (horizontal, axis=1, columns), I tried the code below
df = df.repartition(npartitions=200) df = df.reset_index(drop=True) df_labelled = df_labelled.repartition(npartitions=200) df_labelled = df_labelled.reset_index(drop=True) df = df.assign(label = df_labelled.label) df.head()
But I get the error
ValueError: Not all divisions are known, can’t align partitions. Please use
set_index
to set the index.
Another thing I have tried is to do left join of the table, but I got NaN for all label, can you explain what I did wrong?
result = dd.merge(df, df_labelled, on=['x', 'y', 'z','R', 'G', 'B'], how="left") result.head() x y z R G B label 0 39020.470000 33884.200003 36.445701 25 39 26 NaN 1 39132.740002 33896.049994 30.405700 19 24 18 NaN 2 39221.059994 33787.050001 26.605700 115 145 145 NaN
Is there anyway I can achieve the expected result as below? I can’t run in Pandas because there are a lot of points which will cause memory issue in Pandas.
Data
df (This file has all points)
x y z R G B 0 39047.700012 33861.890015 48.115704 7 18 12 1 39044.110016 33860.150024 47.135700 14 28 15 2 39049.280029 33861.950073 49.405701 30 58 33 3 39029.030000 33937.689993 48.425700 152 154 143 4 39066.980000 33937.870001 49.725699 209 218 225 5 39069.810002 33795.460001 42.405699 113 136 154
df_labelled (This file contains a portion of labelled points)
x y z R G B label 0 39047.700012 33861.890015 48.115704 7 18 12 14 1 39044.110016 33860.150024 47.135700 14 28 15 14 2 39049.280029 33861.950073 49.405701 30 58 33 14
Expected outcome
x y z R G B label 0 39047.700012 33861.890015 48.115704 7 18 12 14 1 39044.110016 33860.150024 47.135700 14 28 15 14 2 39049.280029 33861.950073 49.405701 30 58 33 14 3 39029.030000 33937.689993 48.425700 152 154 143 nan 4 39066.980000 33937.870001 49.725699 209 218 225 nan 5 39069.810002 33795.460001 42.405699 113 136 154 nan
Advertisement
Answer
I think when you do something like this then error:
df = df.assign(label = df_labelled.label)
because there is no index in dataframe df or/and df_labelled. And Dask doesn’t support multiple index as Pandas. Instead of using index, define the left key and right key if you have more than one key to merge dataframe in Dask. This one is works for me :
result = dd.merge(df, df_labelled, left_on=['x', 'y', 'z','R', 'G', 'B'], right_on = ['x', 'y', 'z','R', 'G', 'B'], suffixes=['_1', '_2'], how="left")