I have a dataframe like below.
JavaScript
x
7
1
id | js |
2
0 | bla var test bla .. |
3
1 | bla function RAM blob |
4
2 | function CPU blob blob |
5
3 | thanks |
6
4 | bla var AWS and function twitter blaa |
7
I am trying to extract the next word after function or var
My code is here.
JavaScript
1
4
1
pattern3 = "(func)s+(w+)|(var)s+(w+)"
2
3
df = df.withColumn("js_extracted2", f.regexp_extract(f.col("js"),pattern3,4))
4
as it is capture only one word, the final row returns only AWS and not Twitter.
So I would like to capture all matching.
My spark version is less than 3,
so I tried df.withColumn('output', f.expr("regexp_extract_all(js, '(func)s+(w+)|(var)s+(w+)', 4)")).show()
but it returns only empty for all rows.
my expected output is
JavaScript
1
7
1
id | js | output
2
0 | bla var test bla .. | [test]
3
1 | bla function RAM blob | [RAM]
4
2 | function CPU blob blob | [CPU]
5
3 | thanks |
6
4 | bla var AWS and function twitter blaa | [AWS, twitter]
7
Advertisement
Answer
You need to use four to form a regular expression.
JavaScript
1
3
1
df = df.withColumn("js_extracted2", F.expr(f"regexp_extract_all(js, '(function|var)\\s+(\\w+)', 2)"))
2
df.show(truncate=False)
3