I have tried so many things to do name entity recognition on a column in my csv file, i tried ne_chunk but i am unable to get the result of my ne_chunk in columns like so
JavaScript
x
3
1
ID STORY PERSON NE NP NN VB GE
2
1 Washington, a police officer James 1 0 0 0 0 1
3
Instead after using this code,
JavaScript
1
15
15
1
news=pd.read_csv("news.csv")
2
3
news['tokenize'] = news.apply(lambda row: nltk.word_tokenize(row['STORY']), axis=1)
4
5
6
news['pos_tags'] = news.apply(lambda row: nltk.pos_tag(row['tokenize']), axis=1)
7
8
news['entityrecog']=news.apply(lambda row: nltk.ne_chunk(row['pos_tags']), axis=1)
9
10
tag_count_df = pd.DataFrame(news['entityrecognition'].map(lambda x: Counter(tag[1] for tag in x)).to_list())
11
12
news=pd.concat([news, tag_count_df], axis=1).fillna(0).drop(['entityrecognition'], axis=1)
13
14
news.to_csv("news.csv")
15
i got this error
JavaScript
1
2
1
IndexError : list index out of range
2
So, i am wondering if i could do this using spaCy which is another thing that i have no clue about. Can anyone help?
Advertisement
Answer
It seems that you are checking the chunks incorrectly, that’s why you get a key error. I’m guessing a little about what you want to do, but this creates new columns for each NER type returned by NLTK. It would be a little cleaner to predefined and zero each NER type column (as this gives you NaN if NERs don’t exist).
JavaScript
1
18
18
1
def extract_ner_count(tagged):
2
entities = {}
3
chunks = nltk.ne_chunk(tagged)
4
for chunk in chunks:
5
if type(chunk) is nltk.Tree:
6
#if you don't need the entities, just add the label directly rather than this.
7
t = ''.join(c[0] for c in chunk.leaves())
8
entities[t] = chunk.label()
9
return Counter(entities.values())
10
11
news=pd.read_csv("news.csv")
12
news['tokenize'] = news.apply(lambda row: nltk.word_tokenize(row['STORY']), axis=1)
13
news['pos_tags'] = news.apply(lambda row: nltk.pos_tag(row['tokenize']), axis=1)
14
news['entityrecognition']=news.apply(lambda row: extract_ner_count(row['pos_tags']), axis=1)
15
news = pd.concat([news, pd.DataFrame(list(news["entityrecognition"]))], axis=1)
16
17
print(news.head())
18
If all you want is the counts the following is more performant and doesn’t have NaNs:
JavaScript
1
19
19
1
tagger = nltk.PerceptronTagger()
2
chunker = nltk.data.load(nltk.chunk._MULTICLASS_NE_CHUNKER)
3
NE_Types = {'GPE', 'ORGANIZATION', 'LOCATION', 'GSP', 'O', 'FACILITY', 'PERSON'}
4
5
def extract_ner_count(text):
6
c = Counter()
7
chunks = chunker.parse(tagger.tag(nltk.word_tokenize(text,preserve_line=True)))
8
for chunk in chunks:
9
if type(chunk) is nltk.Tree:
10
c.update([chunk.label()])
11
return c
12
13
news=pd.read_csv("news.csv")
14
for NE_Type in NE_Types:
15
news[NE_Type] = 0
16
news.update(list(news["STORY"].apply(extract_ner_count)))
17
18
print(news.head())
19