I have a data file saved as .txt format which has a header row at the top, and is pipe delimited. I am working in databricks, and am needing to create a spark dataframe of this data, with all columns read in as StringType(), the headers defined by the first row, and the columns separated based on the pipe delimiter.
When importing .csv files I am able to set the delimiter and header options. However, I am not able to get the .txt files to import in the same way.
Example Data (completely made up)… for ease, please imagine it is just called datafile.txt:
URN|Name|Supported
12233345757777701|Tori|Yes
32313185648456414|Dave|No
46852554443544854|Steph|No
I would really appreciate a hand in getting this imported into a Spark dataframe so that I can crack on with other parts of the analysis. Thank you!
Advertisement
Answer
Any delimiter separated file is a good candidate for csv reading methods. The ‘c’ of csv is mostly by convention. Thus nothing stops us from reading this:
col1|col2|col3 0|1|2 1|3|8
Like this (in pure python):
import csv from pathlib import Path with Path("pipefile.txt").open() as f: reader = csv.DictReader(f, delimiter="|") data = list(reader) print(data)
Since whatever custom reader your libraries are using probably uses csv.reader
under the hood you simply need to figure out how to pass the right separator to it.
@blackbishop notes in a comment that
spark.read.csv("datafile.text", header=True, sep="|")
would be the appropriate spark call.