Skip to content

spacy matcher returns right answer when two words are set as seperate ‘TEXT’ conditional object only. Why is it?

I’m trying to set a matcher finding word ‘iPhone X’.

The sample code says I should follow below.

import spacy

# Import the Matcher
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders")

# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]

# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", None, pattern)

# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

I tried another approach by putting like below.

# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{"TEXT": "iPhone X"}]

# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", None, pattern)

Why is the second approach not working? I assumed if I put the two word ‘iPhone’ and ‘X’ together, it might work as the same way cause it regard the word with space in the middle as a long unique word. But it didn’t.

The possible reason I could think of is, matcher condition should be a single word without empty space. Am I right? or is there another reason the second approach not working?

Thank you.



The answer is in how Spacy tokenizes the string:

>>> print([t.text for t in doc])
['Upcoming', 'iPhone', 'X', 'release', 'date', 'leaked', 'as', 'Apple', 'reveals', 'pre', '-', 'orders']

As you see, the iPhone and X are separate tokens. See the Matcher reference:

A pattern added to the Matcher consists of a list of dictionaries. Each dictionary describes one token and its attributes.

Thus, you cannot use them both in one token definition.
