Skip to content
Advertisement

PySpark 2.4 – Read CSV file with custom line separator

Support for custom line separators (for various text file formats) was added to spark in 2017 (see: https://github.com/apache/spark/pull/18581).

… or maybe it wasn’t added in 2017 – or ever (see: https://github.com/apache/spark/pull/18304)

Today, with Pyspark 2.4.0 I am unable to use custom line separators to parse CSV files.

Here’s some code:

JavaScript

Here’s two sample csv files: one.csv – lines are separated by line feed character ‘0A’

JavaScript

two.csv – lines are separated by group separator character ‘1D’

JavaScript

I want the output from the code to be:

JavaScript

The output I receive is:

JavaScript

And ideas on how I can get Pyspark to accept the Group separator char as a line separator?

Advertisement

Answer

I can get the result I want with this:

JavaScript

It depends on Pandas and the data might be read twice here (I’m not sure what happens internally when a RDD is created from a panda dataFrame).

Advertisement