I have a list which contains some words and I need to extract matching words from a text line, I found this, but it only extracts one word.
keys file content
this is a keyword
part_description file content
32015 this is a keyword hello world
Code
JavaScript
x
9
1
import pyspark.sql.functions as F
2
3
keywords = sc.textFile('file:///home/description_search/keys') #1
4
part_description = sc.textFile('file:///description_search/part_description') #2
5
keywords = keywords.map(lambda x: x.split(' ')) #3
6
keywords = keywords.collect()[0] #4
7
df = part_description.map(lambda r: Row(r)).toDF(['line']) #5
8
df.withColumn('extracted_word', F.regexp_extract(df['line'],'|'.join(keywords), 0)).show() #6
9
Outputs
JavaScript
1
6
1
+--------------------+--------------+
2
| line|extracted_word|
3
+--------------------+--------------+
4
|32015 this is a| this|
5
+--------------------+--------------+
6
Expected output
JavaScript
1
6
1
+--------------------+-----------------+
2
| line| extracted_word|
3
+--------------------+-----------------+
4
|32015 this is a|this,is,a,keyword|
5
+--------------------+-----------------+
6
I want to
return all matching keyword and their count
and if
step #4
is the most effecient way
Reproducible example:
JavaScript
1
11
11
1
keywords = ['this','is','a','keyword']
2
l = [('32015 this is a keyword hello world' , ),
3
('keyword this' , ),
4
('32015 this is a keyword hello world 32015 this is a keyword hello world' , ),
5
('keyword keyword' , ),
6
('is a' , )]
7
8
columns = ['line']
9
10
df=spark.createDataFrame(l, columns)
11
Advertisement
Answer
I managed to solve it by using UDF instead as below
JavaScript
1
19
19
1
def build_regex(keywords):
2
res = '('
3
for key in keywords:
4
res += '\b' + key + '\b|'
5
res = res[0:len(res) - 1] + ')'
6
7
return res
8
9
10
def get_matching_string(line, regex):
11
matches = re.findall(regex, line)
12
return matches if matches else None
13
14
15
udf_func = udf(lambda line, regex: get_matching_string(line, regex),
16
ArrayType(StringType()))
17
18
df = df.withColumn('matched', udf_func(df['line'], F.lit(build_regex(keywords)))).withColumn('count', F.size('matched'))
19
Result
JavaScript
1
8
1
+--------------------+--------------------+-----+
2
| line| matched|count|
3
+--------------------+--------------------+-----+
4
|32015 this is |[this, is, this, | 5|
5
|12832 Shb is a| [is, a]| 2|
6
|35015 this is | [this, is]| 2|
7
+--------------------+--------------------+-----+
8