Skip to content
Advertisement

Apache Beam Python: returning conditional statement using ParDo class

I want to check, if the CSV file we read in the pipeline of apache beam, satisfies the format I’m expecting it to be in [Ex: field check, type check, null value check, etc.], before performing any transformation.

Performing these checks outside the pipeline for every file will take away the concept of parallelism, so I just wanted to know if it was possible to perform it inside the pipeline.

An example of what the code might look like:

JavaScript

Advertisement

Answer

I replicated a small instance of the problem with with just two rules. if the length of 3th column is >10 and value of 4th column is >500 then that record is good else its a bad record.

JavaScript

I have written a ParDo FilterFn that Tags output based of isGoodRow function. You can rewrite the logic with your requirements ex. null checks, data validation etc. The other option could be to use a Partition Function for this use case.

The sample.txt I used for this example:

JavaScript

Output of good.txt

JavaScript

Output of bad.txt

JavaScript
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement