Skip to content
Advertisement

Is it possible to use spacy with already tokenized input?

I have a sentence that has already been tokenized into words. I want to get the part of speech tag for each word in the sentence. When I check the documentation in SpaCy I realized it starts with the raw sentence. I don’t want to do that because in that case, the spacy might end up with a different tokenization. Therefore, I wonder if using spaCy with the list of words (rather than a string) is possible or not ?

Here is an example about my question:

JavaScript

But I want to do something similar to the following:

JavaScript

I know, it doesn’t work, but is it possible to do something similar to that ?

Advertisement

Answer

You can do this by replacing spaCy’s default tokenizer with your own:

JavaScript

Where custom_tokenizer is a function taking raw text as input and returning a Doc object.

You did not specify how you got the list of tokens. If you already have a function that takes raw text and returns a list of tokens, just make a small change to it:

JavaScript

See the documentation on Doc.

If for some reason you cannot do this (maybe you don’t have access to the tokenization function), you can use a dictionary:

JavaScript

Either way, you can then use the pipeline as in your first example:

JavaScript
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement