PySpark 2.4 – Read CSV file with custom line separator

Question

Support for custom line separators (for various text file formats) was added to spark in 2017 (see: https://github.com/apache/spark/pull/18581). &#8230; or maybe it wasn&#8217;t added in 2017 &#8211; or ever (see: https://github.com/apache/spark/pull/18304) Today, with Pyspark 2.4.0 I am unable to use custom …

Accepted Answer

I can get the result I want with this:import pandas as pdpadf = pd.read_csv("/dbfs/mnt/two.csv",                  engine="c",                  sep="x1e",                  lineterminator ="x1d",                  header=None,                  names=['id','desc'])df = sqlContext.createDataFrame(padf)print("two.csv rowcount: {}".format(df.count()))It depends on Pandas and the data might be read twice here (I&#8217;m not sure what happens internally when a RDD is created from a panda dataFrame).

Advertisement

Answer