Hello I have two jsonl
files like so:
one.jsonl
{"name": "one", "description": "testDescription...", "comment": "1"} {"name": "two", "description": "testDescription2...", "comment": "2"}
second.jsonl
{"name": "eleven", "description": "testDescription11...", "comment": "11"} {"name": "twelve", "description": "testDescription12...", "comment": "12"} {"name": "thirteen", "description": "testDescription13...", "comment": "13"}
And my goal is to write a new jsonl
file (with encoding preserved) name merged_file.jsonl
which will look like this:
{"name": "one", "description": "testDescription...", "comment": "1"} {"name": "two", "description": "testDescription2...", "comment": "2"} {"name": "eleven", "description": "testDescription11...", "comment": "11"} {"name": "twelve", "description": "testDescription12...", "comment": "12"} {"name": "thirteen", "description": "testDescription13...", "comment": "13"}
My approach is like this:
import json import glob result = [] for f in glob.glob("folder_with_all_jsonl/*.jsonl"): with open(f, 'r', encoding='utf-8-sig') as infile: try: result.append(extract_json(infile)) #tried json.loads(infile) too except ValueError: print(f) #write the file in BOM TO preserve the emojis and special characters with open('merged_file.jsonl','w', encoding= 'utf-8-sig') as outfile: json.dump(result, outfile)
However I am met with this error:
TypeError: Object of type generator is not JSON serializable
I will apprecite your hint/help in any ways. Thank you! I have looked other SO repos, they are all writing normal json files, which should work in my case too, but its keep failing.
Reading single file like this works:
data_json = io.open('one.jsonl', mode='r', encoding='utf-8-sig') # Opens in the JSONL file data_python = extract_json(data_json) for line in data_python: print(line) ####outputs#### #{'name': 'one', 'description': 'testDescription...', 'comment': '1'} #{'name': 'two', 'description': 'testDescription2...', 'comment': '2'}
Advertisement
Answer
It is possible that extract_json returns a generator instead of a list/dict which is json serializable
since it is jsonl, which means each line is a valid json
so you just need to tweak your existing code a little bit.
import json import glob result = [] for f in glob.glob("folder_with_all_jsonl/*.jsonl"): with open(f, 'r', encoding='utf-8-sig') as infile: for line in infile.readlines(): try: result.append(json.loads(line)) # read each line of the file except ValueError: print(f) # This would output jsonl with open('merged_file.jsonl','w', encoding= 'utf-8-sig') as outfile: #json.dump(result, outfile) #write each line as a json outfile.write("n".join(map(json.dumps, result)))
Now that I think about it you didn’t even have to load it using json, except it will help you sanitize any badly formatted JSON lines is all
you could collect all the lines in one shot like this
outfile = open('merged_file.jsonl','w', encoding= 'utf-8-sig') for f in glob.glob("folder_with_all_jsonl/*.jsonl"): with open(f, 'r', encoding='utf-8-sig') as infile: for line in infile.readlines(): outfile.write(line) outfile.close()