I'm trying to set a matcher finding word 'iPhone X'. The sample code says I should follow below. I tried another approach by putting like below. Why is the second approach not working? I assumed if I put the two word 'iPhone' and 'X' together, it might work as the same way cause it regard the word with space in

spacy matcher returns right answer when two words are set as seperate ‘TEXT’ conditional object only. Why is it?

I’m trying to set a matcher finding word ‘iPhone X’.

The sample code says I should follow below.

import spacy

# Import the Matcher
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders")

# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]

# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", None, pattern)

# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

JavaScript
​x
 
import spacy
​
# Import the Matcher
from spacy.matcher import Matcher
​
nlp = spacy.load("en_core_web_sm")
doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders")
​
# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)
​
# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
​
# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", None, pattern)
​
# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])
​

I tried another approach by putting like below.

# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{"TEXT": "iPhone X"}]

# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", None, pattern)

JavaScript
 
# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{"TEXT": "iPhone X"}]
​
# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", None, pattern)
​

Why is the second approach not working? I assumed if I put the two word ‘iPhone’ and ‘X’ together, it might work as the same way cause it regard the word with space in the middle as a long unique word. But it didn’t.

The possible reason I could think of is, matcher condition should be a single word without empty space. Am I right? or is there another reason the second approach not working?

Thank you.

Answer

The answer is in how Spacy tokenizes the string:

>>> print([t.text for t in doc])
['Upcoming', 'iPhone', 'X', 'release', 'date', 'leaked', 'as', 'Apple', 'reveals', 'pre', '-', 'orders']

JavaScript
 
>>> print([t.text for t in doc])
['Upcoming', 'iPhone', 'X', 'release', 'date', 'leaked', 'as', 'Apple', 'reveals', 'pre', '-', 'orders']
​

As you see, the iPhone and X are separate tokens. See the Matcher reference:

A pattern added to the Matcher consists of a list of dictionaries. Each dictionary describes one token and its attributes.

Thus, you cannot use them both in one token definition.

Advertisement

Answer