Is it possible to use spacy with already tokenized input?

Question

I have a sentence that has already been tokenized into words. I want to get the part of speech tag for each word in the sentence. When I check the documentation in SpaCy I realized it starts with the raw sentence. I don&#8217;t want to do that because in that case, the spacy might end up with a different toke…

Accepted Answer

You can do this by replacing spaCy&#8217;s default tokenizer with your own:nlp.tokenizer = custom_tokenizerWhere custom_tokenizer is a function taking raw text as input and returning a Doc object.You did not specify how you got the list of tokens. If you already have a function that takes raw text and returns a list of tokens, just make a small change to it:def custom_tokenizer(text):    tokens = []    # your existing code to fill the list with tokens    # replace this line:    return tokens    # with this:    return Doc(nlp.vocab, tokens)See the documentation on Doc.If for some reason you cannot do this (maybe you don&#8217;t have access to the tokenization function), you can use a dictionary:tokens_dict = {'Hello, world.': ['Hello', ',', 'world', '.']}def custom_tokenizer(text):    if text in tokens_dict:        return Doc(nlp.vocab, tokens_dict[text])    else:        raise ValueError('No tokenization available for input.')Either way, you can then use the pipeline as in your first example:doc = nlp('Hello, world.')

Advertisement

Answer