Skip to content
Advertisement

Using join to find similarities between two datasets containing strings in PySpark

I’m trying to match text records in two datasets, mostly via using PySpark (not using libraries such as BM25 or NLP techniques as much as I can for now -using Spark ML and SparkNLP libraries are fine).

I’m towards finishing the pre-processing phase. I’ve cleaned the text in both datasets, tokenized it and created bi-Grams (stored in a column called biGrams in both datasets).

Also, below is the schema of the two datasets. Please note the different IDs.

df1 columns:
int_id
fullText
biGrams

df2 columns:
ext_id
fullText
biGrams

My plan is to outer join the two datasets, explode on the biGrams, and then return the similar ones using something like inner join.

Now, I want to join (outer join) the two datasets using biGram columns with the code below:

full_similarity_df = df1.join(df2, on=[df1.biGrams == df2.biGrams], how = 'outer')

But I get a long error message with tracebacks, with the following error as the main error:

Found duplicate column(s) when inserting into file

Why is that, and how can I rectify this?

Please let me know if you can think of any other suggestions for tackling this problem.

Thanks in advance, Rfai

Advertisement

Answer

The reason is when you display full_similarity_df you will see 2 fullText and biGrams columns like below

+------+-----------+-------+------+--------+-------+
|int_id|   fulltext|bigrams|ext_id|fulltext|bigrams|
+------+-----------+-------+------+--------+-------+
|     1|abc def fhg|abc def|     1| abc def|abc def|
|     2|abc def fhg|abc fhg|  null|    null|   null|
+------+-----------+-------+------+--------+-------+

so if you give and alias to them then you won’t get the duplicate column name issue

full_similarity_df = df1.join(df2, on=[df1.bigrams == df2.bigrams], how = 'outer').select("int_id",df1.fulltext.alias("df1_fulltext"),df1.bigrams.alias("df1_bigrams"),"ext_id",df2.fulltext.alias("df2_fulltext"),df2.bigrams.alias("df2_bigrams"))
full_similarity_df.show()
+------+------------+-----------+------+------------+-----------+
|int_id|df1_fulltext|df1_bigrams|ext_id|df2_fulltext|df2_bigrams|
+------+------------+-----------+------+------------+-----------+
|     1| abc def fhg|    abc def|     1|     abc def|    abc def|
|     2| abc def fhg|    abc fhg|  null|        null|       null|
+------+------------+-----------+------+------------+-----------+
Advertisement