Using join to find similarities between two datasets containing strings in PySpark

Question

I'm trying to match text records in two datasets, mostly via using PySpark (not using libraries such as BM25 or NLP techniques as much as I can for now -using Spark ML and SparkNLP libraries are fine). I'm towards finishing the pre-processing phase. I've cleaned the text in both datasets, tokenized it and created bi-Grams (stored in a column called

Accepted Answer

The reason is when you display full_similarity_df you will see 2 fullText and biGrams columns like below+------+-----------+-------+------+--------+-------+|int_id|   fulltext|bigrams|ext_id|fulltext|bigrams|+------+-----------+-------+------+--------+-------+|     1|abc def fhg|abc def|     1| abc def|abc def||     2|abc def fhg|abc fhg|  null|    null|   null|+------+-----------+-------+------+--------+-------+so if you give and alias to them then you won&#8217;t get the duplicate column name issuefull_similarity_df = df1.join(df2, on=[df1.bigrams == df2.bigrams], how = 'outer').select("int_id",df1.fulltext.alias("df1_fulltext"),df1.bigrams.alias("df1_bigrams"),"ext_id",df2.fulltext.alias("df2_fulltext"),df2.bigrams.alias("df2_bigrams"))full_similarity_df.show()+------+------------+-----------+------+------------+-----------+|int_id|df1_fulltext|df1_bigrams|ext_id|df2_fulltext|df2_bigrams|+------+------------+-----------+------+------------+-----------+|     1| abc def fhg|    abc def|     1|     abc def|    abc def||     2| abc def fhg|    abc fhg|  null|        null|       null|+------+------------+-----------+------+------------+-----------+

Advertisement

Answer