I have a dataset containing the string value of book title pages (e.g. all words on the title page, each line of my txt file is a different book). From this I am trying to retrieve the author’s name as the human name which appears on the title page, and store each name on a separate line in a csv file. When I type the following code I get a “no author” value for every entry, which is not plausible based on the input data. Can someone help me figure out what is going wrong? Thanks, I have been racking my head on this for the past few days with no results.
import stanza import csv stanza.download('en') nlp = stanza.Pipeline('en') def get_human_names(text,output): with open(text, 'r', encoding = "ISO-8859-1") as txt_file: Lines=txt_file.readlines() person_list=[] for line in Lines: doc=nlp(str(line)) for sent in doc.sentences: for token in sent.tokens: if {token.ner}=='B-PERSON' or {token.ner}=='E-PERSON': person_list.append({token.text}) if(len(person_list)==0): ## avoid skipping entries in the output file person_list=["no author"] with open(output, 'a') as csv_output: writer=csv.writer(csv_output) writer.writerow(person_list) get_human_names('/Users/tancredirapone/Desktop/LoC_Project/titles.txt','/Users/tancredirapone/Desktop/LoC_Project/titles_author_stanza.csv')
Advertisement
Answer
In case anyone has a similar issue… This seems to work, but the results are not altogether satisfactory (i.e. several names missed). I don’t know if this is because of the code I wrote or just stanza missing names once in a while, but I suspect it’s the latter.
import csv import stanza stanza.download('en') nlp=stanza.Pipeline('en') with open('/Users/tancredirapone/Desktop/LoC_Project/titles.csv', 'r', encoding = "ISO-8859-1") as txt_file: reader=csv.reader(txt_file) for row in reader: person_list=[] doc=nlp(str(row)) for i, sentence in enumerate(doc.sentences): for token in sentence.tokens: if "PERSON" in str({token.ner}): person_list.append({token.text}) if len(person_list)==0: person_list=["no author"] with open('/Users/tancredirapone/Desktop/LoC_Project/author_names.csv', 'a') as csv_output: writer=csv.writer(csv_output) writer.writerow(person_list) person_list=[]
a possibility is that perhaps stanza misses foreign names, but as far as I know it’s not possible to create a pipeline with multiple languages (nlp=stanza.Pipeline(‘en’, ‘de’, ‘fr’ …).