Skip to content
Advertisement

Comma separated data in rdd (pyspark) indices out of bound problem

I have a csv file which is comma separated. One of the columns has data which is again comma separated. Each row in that specific column has different no of words , hence different number of commas. When I access or perform any sort of operation like filtering (after splitting the data) it throws errors in pyspark. How shall I handle such kind of data? eg one of the columns is colors and say the data for each entry is different, 1. red,blue 2. red,blue,orange. After splitting ,the indices for the next columns change/shift for every row.

data is tabular form

JavaScript

Data is comma separated so it appears as shown below when opened through text editor.

JavaScript

I tried doing the following operations, both don’t work. How to handle such data.

JavaScript

Advertisement

Answer

Something like this should work:

JavaScript
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement