Skip to content
Advertisement

Finding words within paragraph using Python [closed]

Let say I have the following words, Test_wrds = ['she', 'her','women'] that I would like to see whether any one of them present in the following str paragraph,

text= "What recent discussions she has had with the Secretary of State for Work and Pensions on the effect of that Department’s welfare policies on women."

The question is, How to find these Test_wrds in text and bold them in different colours as well as count them how many times Test_wrds appeared in Para. So I am expecting output something like this,

text= " What recent discussions **she** has had with the Secretary of State for Work and Pensions on the effect of that Department’s welfare policies on **women**.

So far, I have written the following codes:

text=" Q: What recent discussions she has had with the Secretary of State for Work and Pensions on the effect of that Department’s welfare policies on women."
Test_wrds = ['she', 'her','women']

import spacy 
nlp = spacy.load("en_core_web_sm") 
doc = nlp(text)
# word split
Wrd_token=[token.orth_ for token in doc]

I am not getting an idea on how to proceed further after this. I used spacy as I found to be powerful and easy for my future coding.
Thanks in advance.

Advertisement

Answer

First of all in order to count how many times each word from Test_wrds list exists in text you can use ORTH which is an ID of the verbatim text content (see here).

import spacy
from spacy.lang.en import English
from spacy.attrs import ORTH

text=" Q: What recent discussions she has had with the Secretary of State for Work and Pensions on the effect of that Department’s welfare policies on women."
Test_wrds = ['she', 'her','women']

nlp = English()

doc = nlp(text)

# Dictionairy with keys each word's id representation and values the number of times this word appears in your text string
count_number = doc.count_by(ORTH)

for wid, number in sorted(count_number.items(), key=lambda x: x[1]):
    # nlp.vocap.strings[wid] gives the word corresponding to id
    if nlp.vocab.strings[wid] in Test_wrds:
        print(number, nlp.vocab.strings[wid])

Output:

1 she
1 women

Second, in order to replace each word with bold you can try

import re

# Avoid words followed by '.' without empty space
text = text.replace('.', ' .')

lista = text.split()

for word in Test_wrds:
    if word in lista:
        indices = [i for i,j in enumerate(lista) if j==word] # Find list indices
        for index in indices:
            lista[index] = re.sub(lista[index], '**'+word+'**', lista[index])
            
new_text = ' '.join(lista)

Output :

>>> new_text
'Q: What recent discussions **she** has had with the Secretary of State for Work and Pensions on the effect of that Department’s welfare policies on **women** .'
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement