I have a txt file that look likes
EU NNP B-NP B-ORG rejects VBZ B-VP O German JJ B-NP B-MISC call NN I-NP O to TO B-VP O boycott VB I-VP O British JJ B-NP B-MISC lamb NN I-NP O . . O O Peter NNP B-NP B-PER Blackburn NNP I-NP I-PER BRUSSELS NNP B-NP B-LOC 1996-08-22 CD I-NP O
And Im trying to make a tuples from this txt which ı will evalute them laterly word to features later on. I want to have a list of list look like this :
[(EU, NNP,B-NP, B-ORG),(rejects, VBZ, B-VP, O),(German, JJ, B-NP, B-MISC),(call, NN, I-NP, O)..... (Peter, NNP, B-NP, B-PER),(Blackburn, NNP, I-N,P I-PER), (BRUSSELS, NNP, B-NP, B-LOC),(1996-08-22, CD, I-NP, O)
All of the whitespaces indicates that the sentences over and should add to list to given index, laterly after whitespace we should move on the next index of the list to add all sentences.
# function to read data, return list of tuples each tuple represents a token contains word, pos tag, chunk tag, and ner tag import csv def read_data(filename) -> list: data = [] sentences = [] with open(filename) as load_file: reader = csv.reader(load_file, delimiter=" ") # read for row in reader: if(len(tuple(row)) != 0): data.append(tuple(row)) sentences.append(data) return sentences
I have a function like this however it return this:
('EU', 'NNP', 'B-NP', 'B-ORG'), ('rejects', 'VBZ', 'B-VP', 'O'), ('German', 'JJ', 'B-NP', 'B-MISC'), ('call', 'NN', 'I-NP', 'O'), ('to', 'TO', 'B-VP', 'O'), ('boycott', 'VB', 'I-VP', 'O'), ('British', 'JJ', 'B-NP', 'B-MISC'), ('lamb', 'NN', 'I-NP', 'O'), ('.', '.', 'O', 'O'), ('Peter', 'NNP', 'B-NP', 'B-PER'), ('Blackburn', 'NNP', 'I-NP', 'I-PER'), ('BRUSSELS', 'NNP', 'B-NP', 'B-LOC'), ('1996-08-22', 'CD', 'I-NP', 'O'),
How can ı solve this problem, ı use 2 different list to add them together but ı could not find a way.
Advertisement
Answer
I think all problem is because you show expected result
[(EU, NNP,B-NP, B-ORG),(rejects, VBZ, B-VP, O),(German, JJ, B-NP, B-MISC),(call, NN, I-NP, O)..... (Peter, NNP, B-NP, B-PER),(Blackburn, NNP, I-N,P I-PER), (BRUSSELS, NNP, B-NP, B-LOC),(1996-08-22, CD, I-NP, O)
but I think you expect
[ [(EU, NNP,B-NP, B-ORG),(rejects, VBZ, B-VP, O),(German, JJ, B-NP, B-MISC),(call, NN, I-NP, O).....], [(Peter, NNP, B-NP, B-PER),(Blackburn, NNP, I-N,P I-PER)], [(BRUSSELS, NNP, B-NP, B-LOC),(1996-08-22, CD, I-NP, O)], ]
and this need
for row in reader: if row: data.append(tuple(row)) else: sentences.append(data) data = []
At the end it may need also to add last data
becuase there is no empty line after these data
if data: sentences.append(data)
Full working example.
I use io
only to simulate file in memory so everyone can copy and run it. But you should use open()
without text
.
text = '''EU NNP B-NP B-ORG rejects VBZ B-VP O German JJ B-NP B-MISC call NN I-NP O to TO B-VP O boycott VB I-VP O British JJ B-NP B-MISC lamb NN I-NP O . . O O Peter NNP B-NP B-PER Blackburn NNP I-NP I-PER BRUSSELS NNP B-NP B-LOC 1996-08-22 CD I-NP O''' import csv import io data = [] sentences = [] #with open(filename) as load_file: with io.StringIO(text) as load_file: reader = csv.reader(load_file, delimiter=" ") # read for row in reader: if row: data.append(tuple(row)) else: sentences.append(data) data = [] # add last data because there is no empty line after these data if data: sentences.append(data) print(sentences)