Have a table where I want to go in range of two rows
id | col b | message 1 | abc | hello | 2 | abc | world | 3 | abc 1| morning| 4 | abc | night | ...|... | .... | 100| abc1 | Monday | 101| abc1 | Tuesday|
How to I create below table that goes in a range of two and shows the first id with the second col b and message in spark.
Final table will look like this.
id | full message 1 | 01:02,abc,world 3 | 03:04,abc,night .. |................ 100| 100:101,abc1,Tuesday
Advertisement
Answer
In pyspark you can use Window, example
window = Window.orderBy('id').rowsBetween(Window.currentRow, 1) (df .withColumn('ids', F.concat_ws(':', F.first('id').over(window), F.last('id').over(window))) .withColumn('messages', F.concat_ws(',', F.first('col b').over(window), F.last('message').over(window))) .withColumn('full_message', F.concat_ws(',', 'ids', 'messages')) # select only the first entries, regardless of the id .withColumn('seq_id', F.row_number().over(Window.orderBy('id'))) .filter(F.col('seq_id') % 2 != 0) .select('id', 'full_message') )
Output:
id full_message 1 1:2,abc,world 3 3:4,abc 1,night 100 100:101,abc1,Tuesday