Import pipe delimited txt file into spark dataframe in databricks

Question

I have a data file saved as .txt format which has a header row at the top, and is pipe delimited. I am working in databricks, and am needing to create a spark dataframe of this data, with all columns read in as StringType(), the headers defined by the first row, and the columns separated based on the pipe del…

Accepted Answer

Any delimiter separated file is a good candidate for csv reading methods.  The &#8216;c&#8217; of csv is mostly by convention.  Thus nothing stops us from reading this:col1|col2|col30|1|21|3|8Like this (in pure python):import csvfrom pathlib import Pathwith Path("pipefile.txt").open() as f:    reader = csv.DictReader(f, delimiter="|")    data = list(reader)print(data)Since whatever custom reader your libraries are using probably uses csv.reader under the hood you simply need to figure out how to pass the right separator to it.@blackbishop notes in a comment thatspark.read.csv("datafile.text", header=True, sep="|")would be the appropriate spark call.

Advertisement

Answer