Extracting human names from text data using python stanza

Question

I have a dataset containing the string value of book title pages (e.g. all words on the title page, each line of my txt file is a different book). From this I am trying to retrieve the author's name as the human name which appears on the title page, and store each name on a separate line in a csv

Accepted Answer

In case anyone has a similar issue&#8230; This seems to work, but the results are not altogether satisfactory (i.e. several names missed). I don&#8217;t know if this is because of the code I wrote or just stanza missing names once in a while, but I suspect it&#8217;s the latter.import csvimport stanzastanza.download('en')nlp=stanza.Pipeline('en')with open('/Users/tancredirapone/Desktop/LoC_Project/titles.csv', 'r', encoding = "ISO-8859-1") as txt_file:        reader=csv.reader(txt_file)        for row in reader:            person_list=[]            doc=nlp(str(row))            for i, sentence in enumerate(doc.sentences):                for token in sentence.tokens:                    if "PERSON" in str({token.ner}):                        person_list.append({token.text})            if len(person_list)==0:                person_list=["no author"]            with open('/Users/tancredirapone/Desktop/LoC_Project/author_names.csv', 'a') as csv_output:                writer=csv.writer(csv_output)                writer.writerow(person_list)            person_list=[]   a possibility is that perhaps stanza misses foreign names, but as far as I know it&#8217;s not possible to create a pipeline with multiple languages (nlp=stanza.Pipeline(&#8216;en&#8217;, &#8216;de&#8217;, &#8216;fr&#8217; &#8230;).

Advertisement

Answer