Input (2 columns) :
col1 , col2 David, 100 "Ronald Sr, Ron , Ram" , 200 Harry potter jr" , 200 Prof. Snape" , 100
Note: Harry and Prof. does not have starting quotes
Output (2 columns)
col1 | col2 David | 100 Ronald Sr , Ron , Ram| 200 Harry potter jr| 200 Prof. Snape| 100
What I tried (PySpark) ?
df = spark.read.format("csv").option("header",True).option("multiLine",True).option("escape","'")
Issue The above code worked fine where multiline had both start and end double quotes (For eg: row starting with Ronald)
But it didnt work with rows where we only have end quotes but no start quotes (like Harry and Prof)
Even if we add start quotes with Harry and Prof that will solve the issue
Any idea using Pyspark , Python or Shell , etc are welcome !!
Advertisement
Answer
Based solely on the small sample provided:
- remove all double quotes
- there are two comma-delimited fields; 1st field is a string, 2nd field is a number
- the 1st field may contain commas and may be broken across multiple lines
- replace the comma delimiter with a pipe (
|
) - OP’s expected output is inconsistent with regards to spacing before the newly inserted pipe (
|
); sometimes a space is removed, sometimes a space is inserted; for now we won’t worry about spacing
One awk
idea:
awk -F, ' { gsub(/"/,"") } # remove double quotes FNR==1 || # if 1st line or last field is a number then ... ($NF+0)==$NF { print prev gensub(FS,"|",(NF-1)) # print any previous line(s) data plus current line, replacing last comma with a pipe prev="" # clear previous line(s) data next # skip to next line of input } { prev= prev $0 " " } # if we get here then this is a broken line so save contents for later printing ' sample.csv
This generates:
col1 | col2 David| 100 Ronald Sr, Ron , Ram | 200 Harry potter jr | 200 Prof. Snape | 100