Comma separated data in rdd (pyspark) indices out of bound problem

Question

I have a csv file which is comma separated. One of the columns has data which is again comma separated. Each row in that specific column has different no of words , hence different number of commas. When I access or perform any sort of operation like filtering (after splitting the data) it throws errors in py…

Accepted Answer

Something like this should work:# Read data and add a row indexrdd = sc.textFile("example.txt").zipWithIndex()# Get first row - columnscolumns = rdd.filter(lambda x: x[1] == 0).map(lambda x: x[0].split(",")).collect()[0]# Get actual data - all the other rowsdata = rdd.filter(lambda x: x[1] > 0).map(lambda x: x[0].split(","))# Split out data rows into fields and covert to a DFdata = data.map(lambda x: (x[0], x[1], ",".join(x[2:-1]), x[-1])).toDF(schema=columns)data.show()+---+--------+----------------+-----+| id|category|           color|price|+---+--------+----------------+-----+|  1|       a|        red,blue| 2000||  2|       b|           black| 5000||  3|       c|green,black,blue| 3000|+---+--------+----------------+-----+

Advertisement

Answer