Extract multiple words using regexp_extract in PySpark

Question

I have a list which contains some words and I need to extract matching words from a text line, I found this, but it only extracts one word. keys file content this is a keyword part_description file content 32015 this is a keyword hello world Code Outputs Expected output I want to return all matching keyword and their count and

Accepted Answer

I managed to solve it by using UDF instead as belowdef build_regex(keywords):    res = '('    for key in keywords:        res += '\b' + key + '\b|'    res = res[0:len(res) - 1] + ')'    return resdef get_matching_string(line, regex):    matches = re.findall(regex, line)    return matches if matches else Noneudf_func = udf(lambda line, regex: get_matching_string(line, regex),               ArrayType(StringType()))df = df.withColumn('matched', udf_func(df['line'], F.lit(build_regex(keywords)))).withColumn('count', F.size('matched'))Result+--------------------+--------------------+-----+|                line|             matched|count|+--------------------+--------------------+-----+|32015    this is ...|[this, is, this, ...|    5||12832    Shb is a...|             [is, a]|    2||35015    this is ...|          [this, is]|    2|+--------------------+--------------------+-----+

Advertisement

Answer