Let’s say I have below dataframe:
JavaScript
x
8
1
user, ticker, date
2
u1, AAPL, 2021-07-07
3
u1, MSFT, 2021-07-07
4
u1, GOOG, 2021-07-07
5
u2, TSLA, 2021-07-07
6
u3, NFLX, 2021-07-07
7
u4, AMZN, 2021-07-07
8
and by design each user has 3 rows. I want to turn my DataFrame into:
JavaScript
1
4
1
user, ticker_1, date_1, ticker_2, date_2, ticker_3, date_3
2
u1, AAPL, 2021-07-07, MSFT, 2021-07-07, GOOG, 2021-07-07
3
u2, TSLA, 2021-07-07, NFLX, 2021-07-07, AMZN, 2021-07-07
4
I was trying to groupBy(col('user'))
and then pivot
by ticker
but it returns as many columns as different tickers there are so instead I wish I could have fixed number of columns. Is there any other Spark operator I could use for that?
Using PySpark and Azure Databricks.
Advertisement
Answer
If the order doesn’t matter, then you can use row_number
to number them
JavaScript
1
6
1
import pyspark.sql.functions as F
2
from pyspark.sql import Window
3
4
df = df.withColumn('rank', F.row_number().over(Window.partitionBy('user').orderBy('ticker')))
5
df = df.groupBy('user').pivot('rank').agg(F.first('ticker').alias('ticker'), F.first('date').alias('date'))
6