Skip to content
Advertisement

pyspark regex extract all

I have a dataframe like below.

id  | js                                    |
0   | bla var test bla ..                   |
1   | bla function RAM blob                 |
2   | function CPU blob blob                |
3   | thanks                                |
4   | bla var AWS and function twitter blaa |

I am trying to extract the next word after function or var

My code is here.

pattern3 = "(func)s+(w+)|(var)s+(w+)"

df = df.withColumn("js_extracted2", f.regexp_extract(f.col("js"),pattern3,4))

as it is capture only one word, the final row returns only AWS and not Twitter.

So I would like to capture all matching.

My spark version is less than 3,

so I tried df.withColumn('output', f.expr("regexp_extract_all(js, '(func)s+(w+)|(var)s+(w+)', 4)")).show()

but it returns only empty for all rows.

my expected output is

id  | js                                    | output
0   | bla var test bla ..                   | [test]
1   | bla function RAM blob                 | [RAM]
2   | function CPU blob blob                | [CPU]
3   | thanks                                | 
4   | bla var AWS and function twitter blaa | [AWS, twitter]

Advertisement

Answer

You need to use four to form a regular expression.

df = df.withColumn("js_extracted2", F.expr(f"regexp_extract_all(js, '(function|var)\\s+(\\w+)', 2)"))
df.show(truncate=False)
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement