Parse multiple line CSV using PySpark , Python or Shell

Question

Input (2 columns) : Note: Harry and Prof. does not have starting quotes Output (2 columns) What I tried (PySpark) ? Issue The above code worked fine where multiline had both start and end double quotes (For eg: row starting with Ronald) But it didnt work with rows where we only have end quotes but no start quotes (like Harry

Accepted Answer

Based solely on the small sample provided:remove all double quotesthere are two comma-delimited fields; 1st field is a string, 2nd field is a numberthe 1st field may contain commas and may be broken across multiple linesreplace the comma delimiter with a pipe (|)OP&#8217;s expected output is inconsistent with regards to spacing before the newly inserted pipe (|); sometimes a space is removed, sometimes a space is inserted; for now we won&#8217;t worry about spacingOne awk idea:awk -F, '             { gsub(/"/,"") }                      # remove double quotesFNR==1 ||                                          # if 1st line or last field is a number then ...($NF+0)==$NF { print prev gensub(FS,"|",(NF-1))    # print any previous line(s) data plus current line, replacing last comma with a pipe               prev=""                             # clear previous line(s) data               next                                # skip to next line of input             }             { prev= prev $0 " " }                 # if we get here then this is a broken line so save contents for later printing' sample.csvThis generates:col1 | col2David| 100Ronald Sr, Ron , Ram | 200Harry potter jr | 200Prof. Snape | 100

Advertisement

Answer