Let say I have the following words, Test_wrds = ['she', 'her','women']
that I would like to see whether any one of them present in the following str
paragraph,
text= "What recent discussions she has had with the Secretary of State for Work and Pensions on the effect of that Department’s welfare policies on women."
The question is, How to find these Test_wrds
in text
and bold them in different colours as well as count them how many times Test_wrds
appeared in Para
. So I am expecting output something like this,
text= " What recent discussions **she** has had with the Secretary of State for Work and Pensions on the effect of that Department’s welfare policies on **women**.
So far, I have written the following codes:
text=" Q: What recent discussions she has had with the Secretary of State for Work and Pensions on the effect of that Department’s welfare policies on women." Test_wrds = ['she', 'her','women'] import spacy nlp = spacy.load("en_core_web_sm") doc = nlp(text) # word split Wrd_token=[token.orth_ for token in doc]
I am not getting an idea on how to proceed further after this. I used spacy
as I found to be powerful and easy for my future coding.
Thanks in advance.
Advertisement
Answer
First of all in order to count how many times each word from Test_wrds list exists in text you can use ORTH which is an ID of the verbatim text content (see here).
import spacy from spacy.lang.en import English from spacy.attrs import ORTH text=" Q: What recent discussions she has had with the Secretary of State for Work and Pensions on the effect of that Department’s welfare policies on women." Test_wrds = ['she', 'her','women'] nlp = English() doc = nlp(text) # Dictionairy with keys each word's id representation and values the number of times this word appears in your text string count_number = doc.count_by(ORTH) for wid, number in sorted(count_number.items(), key=lambda x: x[1]): # nlp.vocap.strings[wid] gives the word corresponding to id if nlp.vocab.strings[wid] in Test_wrds: print(number, nlp.vocab.strings[wid])
Output:
1 she 1 women
Second, in order to replace each word with bold you can try
import re # Avoid words followed by '.' without empty space text = text.replace('.', ' .') lista = text.split() for word in Test_wrds: if word in lista: indices = [i for i,j in enumerate(lista) if j==word] # Find list indices for index in indices: lista[index] = re.sub(lista[index], '**'+word+'**', lista[index]) new_text = ' '.join(lista)
Output :
>>> new_text 'Q: What recent discussions **she** has had with the Secretary of State for Work and Pensions on the effect of that Department’s welfare policies on **women** .'