Skip to content
Advertisement

Filter pyspark DataFrame by string match

i would like check substring match between comments and keyword column and find if anyone of the keywords present in that particular row.

input

   name               comments                keywords
0  paul      account is active  active,activated,activ
1  john   account is activated  active,activated,activ
2   max  account is activateds  active,activated,activ

expected output

match 
True
True
True

Advertisement

Answer

The most efficient here is to loop, you can use set intersection:

df['match'] = [set(c.split()).intersection(k.split(',')) > set()
               for c,k in zip(df['comments'], df['keywords'])]

Output:

   name               comments                keywords  match
0  paul      account is active  active,activated,activ   True
1  john   account is activated  active,activated,activ   True
2   max  account is activateds  active,activated,activ  False

Used input:

df = pd.DataFrame({'name': ['paul' , 'john' , 'max'],
                   'comments': ['account is active' ,'account is activated','account is activateds'],
                   'keywords': ['active,activated,activ', 'active,activated,activ', 'active,activated,activ']})

With a minor variation you could check for substring match (“activ” would match “activateds”):

df['substring'] = [any(w in c for w in k.split(','))
                   for c,k in zip(df['comments'], df['keywords'])]

Output:

   name               comments                keywords  substring
0  paul      account is active  active,activated,activ       True
1  john   account is activated  active,activated,activ       True
2   max  account is activateds  active,activated,activ       True
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement