Have a table where I want to go in range of two rows
JavaScript
x
9
1
id | col b | message
2
1 | abc | hello |
3
2 | abc | world |
4
3 | abc 1| morning|
5
4 | abc | night |
6
| | . |
7
100| abc1 | Monday |
8
101| abc1 | Tuesday|
9
How to I create below table that goes in a range of two and shows the first id with the second col b and message in spark.
Final table will look like this.
JavaScript
1
6
1
id | full message
2
1 | 01:02,abc,world
3
3 | 03:04,abc,night
4
.. | .
5
100| 100:101,abc1,Tuesday
6
Advertisement
Answer
In pyspark you can use Window, example
JavaScript
1
12
12
1
window = Window.orderBy('id').rowsBetween(Window.currentRow, 1)
2
3
(df
4
.withColumn('ids', F.concat_ws(':', F.first('id').over(window), F.last('id').over(window)))
5
.withColumn('messages', F.concat_ws(',', F.first('col b').over(window), F.last('message').over(window)))
6
.withColumn('full_message', F.concat_ws(',', 'ids', 'messages'))
7
# select only the first entries, regardless of the id
8
.withColumn('seq_id', F.row_number().over(Window.orderBy('id')))
9
.filter(F.col('seq_id') % 2 != 0)
10
.select('id', 'full_message')
11
)
12
Output:
JavaScript
1
5
1
id full_message
2
1 1:2,abc,world
3
3 3:4,abc 1,night
4
100 100:101,abc1,Tuesday
5