I’m trying to perform keyphrase extraction with Python, using KeyBert and pke PositionRank. You can see an extract of my code below.
JavaScript
x
25
25
1
from keybert import KeyBERT
2
from keyphrase_vectorizers import KeyphraseCountVectorizer
3
import pke
4
5
text = "The life-cycle Global Warming Potential of the building resulting from the construction has been calculated for each stage in the life-cycle and is disclosed to investors and clients on demand" #text_cleaning(df_tassonomia.iloc[1077].text, sentence_adjustment, stop_words)
6
7
# Pke
8
extractor = pke.unsupervised.PositionRank()
9
extractor.load_document(text, language='en')
10
extractor.candidate_selection(maximum_word_number = 5)
11
extractor.candidate_weighting(window = 10)
12
keyphrases = extractor.get_n_best(n=10)
13
print(keyphrases)
14
15
# KeyBert
16
kw_model = KeyBERT(model = "all-mpnet-base-v2")
17
keyphrases_2 = kw_model.extract_keywords(docs=text,
18
vectorizer=KeyphraseCountVectorizer(),
19
keyphrase_ngram_range = (1,5),
20
top_n=10
21
)
22
23
print("")
24
print(keyphrases_2)
25
and here the results:
JavaScript
1
4
1
[('cycle global warming potential', 0.44829175082921835), ('life', 0.17858359644549557), ('cycle', 0.15775994057934534), ('building', 0.09131084381406684), ('construction', 0.08860454878871142), ('investors', 0.05426710724030216), ('clients', 0.054111700289631526), ('stage', 0.045672396861507744), ('demand', 0.039158055731066406)]
2
3
[('cycle global warming potential', 0.5444), ('building', 0.4479), ('construction', 0.3476), ('investors', 0.1967), ('clients', 0.1519), ('demand', 0.1484), ('cycle', 0.1312), ('stage', 0.0931), ('life', 0.0847)]
4
I would like to handle hyphenated compound words (as life-cycle in the example) are considered as a unique word, but I cannot understand how to exclude the – from the words separators list.
Thank you in advance for any help. Francesca
Advertisement
Answer
this could be a silly workaround but it may help :
JavaScript
1
35
35
1
from keybert import KeyBERT
2
from keyphrase_vectorizers import KeyphraseCountVectorizer
3
import pke
4
text = "The life-cycle Global Warming Potential of the building
5
resulting from the construction has been calculated for each stage in
6
the life-cycle and is disclosed to investors and clients on demand"
7
8
# Pke
9
tokens = text.split()
10
orignal = set([x for x in tokens if "_" in x])
11
text = text.replace("-", "_")
12
extractor = pke.unsupervised.PositionRank()
13
extractor.load_document(text, language='en')
14
extractor.candidate_selection(maximum_word_number=5)
15
extractor.candidate_weighting(window=10)
16
keyphrases = extractor.get_n_best(n=10)
17
keyphrases_replaced = []
18
for pair in keyphrases:
19
if "_" in pair[0] and pair[0] not in orignal:
20
keyphrases_replaced.append((pair[0].replace("_","-"),pair[1]))
21
else:
22
keyphrases_replaced.append(pair)
23
print(keyphrases_replaced)
24
# KeyBert
25
26
keyphrases_2 = kw_model.extract_keywords(docs=text,
27
28
vectorizer=KeyphraseCountVectorizer(),
29
keyphrase_ngram_range=(1, 5),
30
top_n=10
31
)
32
33
print("")
34
print(keyphrases_2)
35
the out put should look like this:
JavaScript
1
2
1
[('life-cycle global warming potential', 0.5511001220016548), ('life-cycle', 0.20123353586644233), ('construction', 0.11945270995269436), ('building', 0.10637157845606555), ('investors', 0.06675114967366767), ('stage', 0.05503532672910801), ('clients', 0.0507262942318816), ('demand', 0.05056281895492815)]
2
I hope this help :)