Skip to content
Advertisement

Extracting human names from text data using python stanza

I have a dataset containing the string value of book title pages (e.g. all words on the title page, each line of my txt file is a different book). From this I am trying to retrieve the author’s name as the human name which appears on the title page, and store each name on a separate line in a csv file. When I type the following code I get a “no author” value for every entry, which is not plausible based on the input data. Can someone help me figure out what is going wrong? Thanks, I have been racking my head on this for the past few days with no results.

import stanza 
import csv
stanza.download('en') 

nlp = stanza.Pipeline('en')

def get_human_names(text,output):
    with open(text, 'r', encoding = "ISO-8859-1") as txt_file:
        Lines=txt_file.readlines()
        person_list=[]
        for line in Lines:
            doc=nlp(str(line))
            for sent in doc.sentences:
                for token in sent.tokens:
                    if {token.ner}=='B-PERSON' or {token.ner}=='E-PERSON':
                        person_list.append({token.text})
                if(len(person_list)==0): ## avoid skipping entries in the output file
                    person_list=["no author"]
            with open(output, 'a') as csv_output:
                writer=csv.writer(csv_output)
                writer.writerow(person_list)

get_human_names('/Users/tancredirapone/Desktop/LoC_Project/titles.txt','/Users/tancredirapone/Desktop/LoC_Project/titles_author_stanza.csv')

Advertisement

Answer

In case anyone has a similar issue… This seems to work, but the results are not altogether satisfactory (i.e. several names missed). I don’t know if this is because of the code I wrote or just stanza missing names once in a while, but I suspect it’s the latter.

import csv
import stanza
stanza.download('en')
nlp=stanza.Pipeline('en')


with open('/Users/tancredirapone/Desktop/LoC_Project/titles.csv', 'r', encoding = "ISO-8859-1") as txt_file:
        reader=csv.reader(txt_file)
        for row in reader:
            person_list=[]
            doc=nlp(str(row))
            for i, sentence in enumerate(doc.sentences):
                for token in sentence.tokens:
                    if "PERSON" in str({token.ner}):
                        person_list.append({token.text})
            if len(person_list)==0:
                person_list=["no author"]
            with open('/Users/tancredirapone/Desktop/LoC_Project/author_names.csv', 'a') as csv_output:
                writer=csv.writer(csv_output)
                writer.writerow(person_list)
            person_list=[]   

a possibility is that perhaps stanza misses foreign names, but as far as I know it’s not possible to create a pipeline with multiple languages (nlp=stanza.Pipeline(‘en’, ‘de’, ‘fr’ …).

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement