Dask concatenate 2 dataframes into 1 single datafr…

Objective

To merge df_labelled file with a portion of labelled points to df where contains all the point.

What I have tried

Referring to Simple way to Dask concatenate (horizontal, axis=1, columns), I tried the code below

df = df.repartition(npartitions=200)
df = df.reset_index(drop=True)
df_labelled = df_labelled.repartition(npartitions=200)
df_labelled = df_labelled.reset_index(drop=True)

df = df.assign(label = df_labelled.label)
df.head()

But I get the error

ValueError: Not all divisions are known, can’t align partitions. Please use set_index to set the index.

Another thing I have tried is to do left join of the table, but I got NaN for all label, can you explain what I did wrong?

result = dd.merge(df, df_labelled, on=['x', 'y', 'z','R', 'G', 'B'], how="left")
result.head()

    x               y               z           R   G   B   label
0   39020.470000    33884.200003    36.445701   25  39  26  NaN
1   39132.740002    33896.049994    30.405700   19  24  18  NaN
2   39221.059994    33787.050001    26.605700   115 145 145 NaN

Is there anyway I can achieve the expected result as below? I can’t run in Pandas because there are a lot of points which will cause memory issue in Pandas.

Data

df (This file has all points)

    x               y               z           R   G   B
0   39047.700012    33861.890015    48.115704   7   18  12  
1   39044.110016    33860.150024    47.135700   14  28  15
2   39049.280029    33861.950073    49.405701   30  58  33
3   39029.030000    33937.689993    48.425700   152 154 143
4   39066.980000    33937.870001    49.725699   209 218 225
5   39069.810002    33795.460001    42.405699   113 136 154

df_labelled (This file contains a portion of labelled points)

    x               y               z           R   G   B   label
0   39047.700012    33861.890015    48.115704   7   18  12  14
1   39044.110016    33860.150024    47.135700   14  28  15  14
2   39049.280029    33861.950073    49.405701   30  58  33  14

Expected outcome

    x               y               z           R   G   B   label
0   39047.700012    33861.890015    48.115704   7   18  12  14
1   39044.110016    33860.150024    47.135700   14  28  15  14
2   39049.280029    33861.950073    49.405701   30  58  33  14
3   39029.030000    33937.689993    48.425700   152 154 143 nan
4   39066.980000    33937.870001    49.725699   209 218 225 nan
5   39069.810002    33795.460001    42.405699   113 136 154 nan

Answer

I think when you do something like this then error:

df = df.assign(label = df_labelled.label)

because there is no index in dataframe df or/and df_labelled. And Dask doesn’t support multiple index as Pandas. Instead of using index, define the left key and right key if you have more than one key to merge dataframe in Dask. This one is works for me :

result = dd.merge(df, df_labelled, left_on=['x', 'y', 'z','R', 'G', 'B'], right_on = ['x', 'y', 'z','R', 'G', 'B'],  suffixes=['_1', '_2'], how="left")

Dask concatenate 2 dataframes into 1 single dataframe

Advertisement

Answer