Skip to content
Advertisement

Read whole file in Apache Beam

Is it possible to read whole file (not line by line) in Apache Beam?

For example, I want to read multiline JSONs, and my idea is to read file by file, extract data from each file and create PCollection from lists.

Is it good idea or it’s better to preprocess source JSONs to one JSON file where each line is separate JSON?

Thank you for advance.

Advertisement

Answer

The TextIO reads the files line-by line. So in your test.json each line needs to contain a separate Json object.

The idea of beam or any distributed processing engine is to be able to parallelize the input data. From your question it looks like some pre-processing would be needed to split these into multiple jsons. Note that it need not be in a single file and you can have multiple files each containing any number of json files. Beam will read the rows in parallel.

Do accept the answer if that helped.

Advertisement