Skip to content
Advertisement

Add new pattern in Entity Ruler Spacy with regex in multiple tokens

I have this code that works well if I try to search exact words.

from spacy.lang.en import English
import spacy

#nlp = spacy.load("en_core_web_sm")
nlp = spacy.load("en_core_web_sm", disable=["tagger", "attribute_ruler", "lemmatizer","ner"])
ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "ORG", "pattern": "Google"},
            {"label": "COLOR", "pattern": "yellow"},
            {"label": "COLOR", "pattern": "red"},
            {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]},
            {"label": "DIN", "pattern": [{"TEXT" : {"REGEX": "DINd"}}]},

            {"label": "DIAM", "pattern": [{"TEXT" : {"REGEX": "diameterd"}}]},  
            {"label": "MATERIAL", "pattern": [{"LOWER": "zinc"}, {"LOWER": "plated"}]},
            {"label": "MATERIAL", "pattern": [{"LOWER": "stainless"}, {"LOWER": "steel"}]},
            
            
            {"label": "BRAND", "pattern": [{"LOWER": "cubitron"},{"LOWER": "ii"}]}            
            
           ]
ruler.add_patterns(patterns)

doc = nlp("Google red yellow DIN 789 opening its first big zinc plated ffice in San Francisco")
print([(ent.text, ent.label_) for ent in doc.ents])

But the regex doesnt work for whole sentence but just for each token.

I tried to add something like this to add new entity but it doesnt still show the new label DIN in the output.

from spacy.tokens import Span

doc = nlp("Google red yellow DIN 180 opening its first big zinc plated ffice in San Francisco")

pattern = r"DINsd"
original_ents = list(doc.ents) 
mwt_ents = []
for match in re.finditer(pattern, doc.text):
   start, end = match.span()
   span = doc.char_span(start, end)
   if span is not None:
       mwt_ents.append((span.start, span.end, span.text))
       
for ent in mwt_ents:
   start, end, name = ent
   per_ent = Span(doc, start, end, label="DIN")
   original_ents.append(per_ent)

doc.ents = original_ents

from spacy.util import filter_spans
filtered = filter_spans(original_ents)
doc.ents = filtered
for ent in doc.ents:
   print (ent.text, ent.label_)

What all am I doing wrong? How can I add to the nlp model new rule based on regex that searches in the whole input? THANKS!!

Advertisement

Answer

Since your regexes are just for numeric tokens, just add a new token to your pattern.

[{"LOWER" : "diameter"}, {"IS_DIGIT": True}]

How can I add to the nlp model new rule based on regex that searches in the whole input?

The Matcher just doesn’t support that. If you want to use regexes against the whole input you can do that yourself and add the spans directly, you don’t need the Matcher.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement