I have this file, it is the result of the MapReduce job so it has key-value
format:
JavaScript
x
3
1
'nullt[0, [[0, 21], [1, 4], [2, 5]]]n'
2
'nullt[1, [[0, 3], [1, 1], [2, 2]]]n'
3
I want to remove all the character except the second element of this value list:
JavaScript
1
3
1
[[0, 21], [1, 4], [2, 5]]
2
[[0, 3], [1, 1], [2, 2]]
3
And finally, add each to a single list:
JavaScript
1
2
1
[[[0, 21], [1, 4], [2, 5]], [[0, 3], [1, 1], [2, 2]]]
2
This is my attempt so far:
JavaScript
1
10
10
1
with open(FILENAME) as f:
2
content = f.readlines()
3
4
for line in content:
5
# Just match all the chars upto "[[" then replace the matched chars with "["
6
clean_line = re.sub(r'^.*?[[', '[', line)
7
# And remove "n" and the last 2 "]]" of the string
8
clean_line = re.sub('[n]', '', clean_line)[:-2]
9
corpus.append(clean_line)
10
Output:
JavaScript
1
2
1
['[0, 21], [1, 4], [2, 5]', '[0, 3], [1, 1], [2, 2]']
2
You can see it is still str
type, how can I make it to list
type?
Advertisement
Answer
Treat it as a line of json and just replace parts of your lines with json documents as needed
JavaScript
1
3
1
import json
2
corpus = [json.loads(line.replace('nullt', '{"a":').replace("n", "}"))["a"][1] for line in content]
3