I have a sentence that has already been tokenized into words. I want to get the part of speech tag for each word in the sentence. When I check the documentation in SpaCy I realized it starts with the raw sentence. I don’t want to do that because in that case, the spacy might end up with a different tokenization. Therefore, I wonder if using spaCy with the list of words (rather than a string) is possible or not ?
Here is an example about my question:
# I know that it does the following sucessfully : import spacy nlp = spacy.load('en_core_web_sm') raw_text = 'Hello, world.' doc = nlp(raw_text) for token in doc: print(token.pos_)
But I want to do something similar to the following:
import spacy nlp = spacy.load('en_core_web_sm') tokenized_text = ['Hello',',','world','.'] doc = nlp(tokenized_text) for token in doc: print(token.pos_)
I know, it doesn’t work, but is it possible to do something similar to that ?
Advertisement
Answer
You can do this by replacing spaCy’s default tokenizer with your own:
nlp.tokenizer = custom_tokenizer
Where custom_tokenizer
is a function taking raw text as input and returning a Doc
object.
You did not specify how you got the list of tokens. If you already have a function that takes raw text and returns a list of tokens, just make a small change to it:
def custom_tokenizer(text): tokens = [] # your existing code to fill the list with tokens # replace this line: return tokens # with this: return Doc(nlp.vocab, tokens)
See the documentation on Doc
.
If for some reason you cannot do this (maybe you don’t have access to the tokenization function), you can use a dictionary:
tokens_dict = {'Hello, world.': ['Hello', ',', 'world', '.']} def custom_tokenizer(text): if text in tokens_dict: return Doc(nlp.vocab, tokens_dict[text]) else: raise ValueError('No tokenization available for input.')
Either way, you can then use the pipeline as in your first example:
doc = nlp('Hello, world.')