Filter pyspark DataFrame by string match

Question

i would like check substring match between comments and keyword column and find if anyone of the keywords present in that particular row. input expected output Answer The most efficient here is to loop, you can use set intersection: Output: Used input: With a minor variation you could check for substring match ("activ" would match "activateds"): Output:

Accepted Answer

The most efficient here is to loop, you can use set intersection:df['match'] = [set(c.split()).intersection(k.split(',')) > set()               for c,k in zip(df['comments'], df['keywords'])]Output:   name               comments                keywords  match0  paul      account is active  active,activated,activ   True1  john   account is activated  active,activated,activ   True2   max  account is activateds  active,activated,activ  FalseUsed input:df = pd.DataFrame({'name': ['paul' , 'john' , 'max'],                   'comments': ['account is active' ,'account is activated','account is activateds'],                   'keywords': ['active,activated,activ', 'active,activated,activ', 'active,activated,activ']})With a minor variation you could check for substring match (&#8220;activ&#8221; would match &#8220;activateds&#8221;):df['substring'] = [any(w in c for w in k.split(','))                   for c,k in zip(df['comments'], df['keywords'])]Output:   name               comments                keywords  substring0  paul      account is active  active,activated,activ       True1  john   account is activated  active,activated,activ       True2   max  account is activateds  active,activated,activ       True

Advertisement

Answer