I am writing some code to perform Named Entity Recognition (NER), which is coming along quite nicely for English texts. However, I would like to be able to apply NER to any language. To do this, I would like to 1) identify the language of a text, and then 2) apply the NER for the identified language. For step 2, I’m doubting to A) translate the text to English, and then apply the NER (in English), or B) apply the NER in the language identified.
Below is the code I have so far. What I would like is for the NER to work for text2, or in any other language, after this language is first recognized:
import spacy from spacy_langdetect import LanguageDetector from langdetect import DetectorFactory text = 'In 1793, Alexander Hamilton recruited Webster to move to New York City and become an editor for a Federalist Party newspaper.' text2 = 'Em 1793, Alexander Hamilton recrutou Webster para se mudar para a cidade de Nova York e se tornar editor de um jornal do Partido Federalista.' # Step 1: Identify the language of a text DetectorFactory.seed = 0 nlp = spacy.load('en_core_web_sm') nlp.add_pipe(LanguageDetector(), name='language_detector', last=True) doc = nlp(text) print(doc._.language) # Step 2: NER Entities = [(str(x), x.label_) for x in nlp(str(text)).ents] print(Entities)
Does anyone have any experience with this? Much appreciated!
Advertisement
Answer
Spacy needs to load the correct model for the right language.
See https://spacy.io/usage/models for available models.
import spacy from langdetect import detect nlp={} for lang in ["en", "es", "pt", "ru"]: # Fill in the languages you want, hopefully they are supported by spacy. if lang == "en": nlp[lang]=spacy.load(lang + '_core_web_lg') else: nlp[lang]=spacy.load(lang + '_core_news_lg') def entites(text): lang = detect(text) try: nlp2 =nlp[lang] except KeyError: return Exception(lang + " model is not loaded") return [(str(x), x.label_) for x in nlp2(str(text)).ents]
Then, you could run the two steps together
ents = entites(text) print(ents)