Skip to content
Advertisement

Keywords extraction in Python – How to handle hyphenated compound words

I’m trying to perform keyphrase extraction with Python, using KeyBert and pke PositionRank. You can see an extract of my code below.

from keybert import KeyBERT
from keyphrase_vectorizers import KeyphraseCountVectorizer
import pke

text = "The life-cycle Global Warming Potential of the building resulting from the construction has been calculated for each stage in the life-cycle and is disclosed to investors and clients on demand" #text_cleaning(df_tassonomia.iloc[1077].text, sentence_adjustment, stop_words)

# Pke
extractor = pke.unsupervised.PositionRank() 
extractor.load_document(text, language='en')
extractor.candidate_selection(maximum_word_number = 5) 
extractor.candidate_weighting(window = 10) 
keyphrases = extractor.get_n_best(n=10)
print(keyphrases)

# KeyBert
kw_model = KeyBERT(model = "all-mpnet-base-v2")
keyphrases_2 = kw_model.extract_keywords(docs=text, 
                                       vectorizer=KeyphraseCountVectorizer(), 
                                       keyphrase_ngram_range = (1,5),
                                       top_n=10
                                      )

print("")
print(keyphrases_2)

and here the results:

[('cycle global warming potential', 0.44829175082921835), ('life', 0.17858359644549557), ('cycle', 0.15775994057934534), ('building', 0.09131084381406684), ('construction', 0.08860454878871142), ('investors', 0.05426710724030216), ('clients', 0.054111700289631526), ('stage', 0.045672396861507744), ('demand', 0.039158055731066406)]

[('cycle global warming potential', 0.5444), ('building', 0.4479), ('construction', 0.3476), ('investors', 0.1967), ('clients', 0.1519), ('demand', 0.1484), ('cycle', 0.1312), ('stage', 0.0931), ('life', 0.0847)]

I would like to handle hyphenated compound words (as life-cycle in the example) are considered as a unique word, but I cannot understand how to exclude the – from the words separators list.

Thank you in advance for any help. Francesca

Advertisement

Answer

this could be a silly workaround but it may help :

from keybert import KeyBERT
from keyphrase_vectorizers import KeyphraseCountVectorizer
import pke
text = "The life-cycle Global Warming Potential of the building 
resulting from the construction has been calculated for each stage in 
the life-cycle and is disclosed to investors and clients on demand"

# Pke
tokens = text.split()
orignal = set([x for x in tokens if "_" in x])
text = text.replace("-", "_")
extractor = pke.unsupervised.PositionRank()
extractor.load_document(text, language='en')
extractor.candidate_selection(maximum_word_number=5)
extractor.candidate_weighting(window=10)
keyphrases = extractor.get_n_best(n=10)
keyphrases_replaced = []
for pair in keyphrases:
    if "_" in pair[0] and pair[0] not in orignal:
        keyphrases_replaced.append((pair[0].replace("_","-"),pair[1]))
   else:
        keyphrases_replaced.append(pair)
print(keyphrases_replaced)
# KeyBert

keyphrases_2 = kw_model.extract_keywords(docs=text,
                                     
vectorizer=KeyphraseCountVectorizer(),
                                     keyphrase_ngram_range=(1, 5),
                                     top_n=10
                                     )

print("")
print(keyphrases_2)

the out put should look like this:

[('life-cycle global warming potential', 0.5511001220016548), ('life-cycle', 0.20123353586644233), ('construction', 0.11945270995269436), ('building', 0.10637157845606555), ('investors', 0.06675114967366767), ('stage', 0.05503532672910801), ('clients', 0.0507262942318816), ('demand', 0.05056281895492815)]

I hope this help :)

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement