Skip to content
Advertisement

Extract multiple words using regexp_extract in PySpark

I have a list which contains some words and I need to extract matching words from a text line, I found this, but it only extracts one word.

keys file content

this is a keyword

part_description file content

32015 this is a keyword hello world

Code

JavaScript

Outputs

JavaScript

Expected output

JavaScript

I want to

  1. return all matching keyword and their count

  2. and if step #4 is the most effecient way

Reproducible example:

JavaScript

Advertisement

Answer

I managed to solve it by using UDF instead as below

JavaScript

Result

JavaScript
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement