Skip to content
Advertisement

Apache beam: Reading and transforming multiple data types from single file

Is there a way to read each data type as it is by a PCollection from a CSV file?
By default, all the values in a row read by a PCollection are converted into a list of strings, but is there a way such that, an integer is considered as integer, float as float, double as double, and string as string, etc.
So that, the PTransformations can be easily performed on each value of the row separately.
Or is it has to be done externally using a ParDo function?

Advertisement

Answer

The root of your issue is that a CSV file only contains strings, so it is necessary to parse the strings as whatever type you know the column contains.

A convenient way to do this is to use Beam’s pandas-compatible dataframe API to read your CSV files, as in:

pipeline | read_csv(...)

This will use pandas to sample the CSV file and guess what the types may be for each column.

You can see more examples and explanation at https://beam.apache.org/documentation/dsls/dataframes/overview/#using-dataframes

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement