I have this file, it is the result of the MapReduce job so it has key-value format: I want to remove all the character except the second element of this value list: And finally, add each to a single list: This is my attempt so far: Output: You can see it is still str type, how can I make it

How to convert string of list of list to list?

I have this file, it is the result of the MapReduce job so it has key-value format:

'nullt[0, [[0, 21], [1, 4], [2, 5]]]n'
'nullt[1, [[0, 3], [1, 1], [2, 2]]]n'

JavaScript
​x
 
'nullt[0, [[0, 21], [1, 4], [2, 5]]]n'
'nullt[1, [[0, 3], [1, 1], [2, 2]]]n'
​

I want to remove all the character except the second element of this value list:

[[0, 21], [1, 4], [2, 5]]
[[0, 3], [1, 1], [2, 2]]

JavaScript
 
[[0, 21], [1, 4], [2, 5]]
[[0, 3], [1, 1], [2, 2]]
​

And finally, add each to a single list:

[[[0, 21], [1, 4], [2, 5]], [[0, 3], [1, 1], [2, 2]]]

JavaScript
 
[[[0, 21], [1, 4], [2, 5]], [[0, 3], [1, 1], [2, 2]]]
​

This is my attempt so far:

with open(FILENAME) as f:
    content = f.readlines()

for line in content:
    # Just match all the chars upto "[[" then replace the matched chars with "["
    clean_line = re.sub(r'^.*?[[', '[', line)
    # And remove "n" and the last 2 "]]" of the string
    clean_line = re.sub('[n]', '', clean_line)[:-2]
    corpus.append(clean_line)

JavaScript
 
with open(FILENAME) as f:
    content = f.readlines()
​
for line in content:
    # Just match all the chars upto "[[" then replace the matched chars with "["
    clean_line = re.sub(r'^.*?[[', '[', line)
    # And remove "n" and the last 2 "]]" of the string
    clean_line = re.sub('[n]', '', clean_line)[:-2]
    corpus.append(clean_line)
​

Output:

['[0, 21], [1, 4], [2, 5]', '[0, 3], [1, 1], [2, 2]']

JavaScript
 
['[0, 21], [1, 4], [2, 5]', '[0, 3], [1, 1], [2, 2]']
​

You can see it is still str type, how can I make it to list type?

Answer

Treat it as a line of json and just replace parts of your lines with json documents as needed

import json
corpus = [json.loads(line.replace('nullt', '{"a":').replace("n", "}"))["a"][1] for line in content]

JavaScript
 
import json
corpus = [json.loads(line.replace('nullt', '{"a":').replace("n", "}"))["a"][1] for line in content]
​

Advertisement

Answer