I’m trying to set a matcher finding word ‘iPhone X’.
The sample code says I should follow below.
import spacy # Import the Matcher from spacy.matcher import Matcher nlp = spacy.load("en_core_web_sm") doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders") # Initialize the Matcher with the shared vocabulary matcher = Matcher(nlp.vocab) # Create a pattern matching two tokens: "iPhone" and "X" pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}] # Add the pattern to the matcher matcher.add("IPHONE_X_PATTERN", None, pattern) # Use the matcher on the doc matches = matcher(doc) print("Matches:", [doc[start:end].text for match_id, start, end in matches])
I tried another approach by putting like below.
# Create a pattern matching two tokens: "iPhone" and "X" pattern = [{"TEXT": "iPhone X"}] # Add the pattern to the matcher matcher.add("IPHONE_X_PATTERN", None, pattern)
Why is the second approach not working? I assumed if I put the two word ‘iPhone’ and ‘X’ together, it might work as the same way cause it regard the word with space in the middle as a long unique word. But it didn’t.
The possible reason I could think of is, matcher condition should be a single word without empty space. Am I right? or is there another reason the second approach not working?
Thank you.
Advertisement
Answer
The answer is in how Spacy tokenizes the string:
>>> print([t.text for t in doc]) ['Upcoming', 'iPhone', 'X', 'release', 'date', 'leaked', 'as', 'Apple', 'reveals', 'pre', '-', 'orders']
As you see, the iPhone
and X
are separate tokens. See the Matcher
reference:
A pattern added to the Matcher consists of a list of dictionaries. Each dictionary describes one token and its attributes.
Thus, you cannot use them both in one token definition.