Skip to content
Advertisement

join two rows itertively to create new table in spark with one row for each two rows in new table

Have a table where I want to go in range of two rows

id | col b | message
1  |  abc  | hello  |
2  |  abc  | world  |
3  |  abc 1| morning|
4  |  abc  |  night |
...|...    |  ....  |
100|  abc1 | Monday |
101|  abc1 | Tuesday|

How to I create below table that goes in a range of two and shows the first id with the second col b and message in spark.

Final table will look like this.

id | full message 
1  | 01:02,abc,world
3  | 03:04,abc,night
.. |................
100| 100:101,abc1,Tuesday

Advertisement

Answer

In pyspark you can use Window, example

window = Window.orderBy('id').rowsBetween(Window.currentRow, 1)

(df
.withColumn('ids', F.concat_ws(':', F.first('id').over(window), F.last('id').over(window)))
.withColumn('messages', F.concat_ws(',', F.first('col b').over(window), F.last('message').over(window)))
.withColumn('full_message', F.concat_ws(',', 'ids', 'messages'))
# select only the first entries, regardless of the id
.withColumn('seq_id', F.row_number().over(Window.orderBy('id')))
.filter(F.col('seq_id') % 2 != 0)
.select('id', 'full_message')
)

Output:

id  full_message
1   1:2,abc,world
3   3:4,abc 1,night
100 100:101,abc1,Tuesday
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement