I have this file, it is the result of the MapReduce job so it has key-value
format:
'nullt[0, [[0, 21], [1, 4], [2, 5]]]n' 'nullt[1, [[0, 3], [1, 1], [2, 2]]]n'
I want to remove all the character except the second element of this value list:
[[0, 21], [1, 4], [2, 5]] [[0, 3], [1, 1], [2, 2]]
And finally, add each to a single list:
[[[0, 21], [1, 4], [2, 5]], [[0, 3], [1, 1], [2, 2]]]
This is my attempt so far:
with open(FILENAME) as f: content = f.readlines() for line in content: # Just match all the chars upto "[[" then replace the matched chars with "[" clean_line = re.sub(r'^.*?[[', '[', line) # And remove "n" and the last 2 "]]" of the string clean_line = re.sub('[n]', '', clean_line)[:-2] corpus.append(clean_line)
Output:
['[0, 21], [1, 4], [2, 5]', '[0, 3], [1, 1], [2, 2]']
You can see it is still str
type, how can I make it to list
type?
Advertisement
Answer
Treat it as a line of json and just replace parts of your lines with json documents as needed
import json corpus = [json.loads(line.replace('nullt', '{"a":').replace("n", "}"))["a"][1] for line in content]