Pivotting DataFrame with fixed column names

Let’s say I have below dataframe:

user, ticker, date
u1,   AAPL,   2021-07-07
u1,   MSFT,   2021-07-07
u1,   GOOG,   2021-07-07
u2,   TSLA,   2021-07-07
u3,   NFLX,   2021-07-07
u4,   AMZN,   2021-07-07

JavaScript
​x
 
user, ticker, date
u1,   AAPL,   2021-07-07
u1,   MSFT,   2021-07-07
u1,   GOOG,   2021-07-07
u2,   TSLA,   2021-07-07
u3,   NFLX,   2021-07-07
u4,   AMZN,   2021-07-07
​

and by design each user has 3 rows. I want to turn my DataFrame into:

user, ticker_1, date_1,     ticker_2, date_2,       ticker_3,  date_3
u1,   AAPL,     2021-07-07, MSFT,     2021-07-07,   GOOG,      2021-07-07
u2,   TSLA,     2021-07-07, NFLX,     2021-07-07,   AMZN,      2021-07-07

JavaScript
 
user, ticker_1, date_1,     ticker_2, date_2,       ticker_3,  date_3
u1,   AAPL,     2021-07-07, MSFT,     2021-07-07,   GOOG,      2021-07-07
u2,   TSLA,     2021-07-07, NFLX,     2021-07-07,   AMZN,      2021-07-07
​

I was trying to groupBy(col('user')) and then pivot by ticker but it returns as many columns as different tickers there are so instead I wish I could have fixed number of columns. Is there any other Spark operator I could use for that?

Using PySpark and Azure Databricks.

Answer

If the order doesn’t matter, then you can use row_number to number them

import pyspark.sql.functions as F
from pyspark.sql import Window

df = df.withColumn('rank', F.row_number().over(Window.partitionBy('user').orderBy('ticker')))
df = df.groupBy('user').pivot('rank').agg(F.first('ticker').alias('ticker'), F.first('date').alias('date'))

JavaScript
 
import pyspark.sql.functions as F
from pyspark.sql import Window
​
df = df.withColumn('rank', F.row_number().over(Window.partitionBy('user').orderBy('ticker')))
df = df.groupBy('user').pivot('rank').agg(F.first('ticker').alias('ticker'), F.first('date').alias('date'))
​

Advertisement

Answer