Skip to content
Advertisement

Import pipe delimited txt file into spark dataframe in databricks

I have a data file saved as .txt format which has a header row at the top, and is pipe delimited. I am working in databricks, and am needing to create a spark dataframe of this data, with all columns read in as StringType(), the headers defined by the first row, and the columns separated based on the pipe delimiter.

When importing .csv files I am able to set the delimiter and header options. However, I am not able to get the .txt files to import in the same way.

Example Data (completely made up)… for ease, please imagine it is just called datafile.txt:

URN|Name|Supported

12233345757777701|Tori|Yes

32313185648456414|Dave|No

46852554443544854|Steph|No

I would really appreciate a hand in getting this imported into a Spark dataframe so that I can crack on with other parts of the analysis. Thank you!

Advertisement

Answer

Any delimiter separated file is a good candidate for csv reading methods. The ‘c’ of csv is mostly by convention. Thus nothing stops us from reading this:

col1|col2|col3
0|1|2
1|3|8

Like this (in pure python):

import csv
from pathlib import Path
with Path("pipefile.txt").open() as f:
    reader = csv.DictReader(f, delimiter="|")
    data = list(reader)
print(data)

Since whatever custom reader your libraries are using probably uses csv.reader under the hood you simply need to figure out how to pass the right separator to it.

@blackbishop notes in a comment that

spark.read.csv("datafile.text", header=True, sep="|")

would be the appropriate spark call.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement