Let’s say I have below dataframe:
user, ticker, date u1, AAPL, 2021-07-07 u1, MSFT, 2021-07-07 u1, GOOG, 2021-07-07 u2, TSLA, 2021-07-07 u3, NFLX, 2021-07-07 u4, AMZN, 2021-07-07
and by design each user has 3 rows. I want to turn my DataFrame into:
user, ticker_1, date_1, ticker_2, date_2, ticker_3, date_3 u1, AAPL, 2021-07-07, MSFT, 2021-07-07, GOOG, 2021-07-07 u2, TSLA, 2021-07-07, NFLX, 2021-07-07, AMZN, 2021-07-07
I was trying to groupBy(col('user'))
and then pivot
by ticker
but it returns as many columns as different tickers there are so instead I wish I could have fixed number of columns. Is there any other Spark operator I could use for that?
Using PySpark and Azure Databricks.
Advertisement
Answer
If the order doesn’t matter, then you can use row_number
to number them
import pyspark.sql.functions as F from pyspark.sql import Window df = df.withColumn('rank', F.row_number().over(Window.partitionBy('user').orderBy('ticker'))) df = df.groupBy('user').pivot('rank').agg(F.first('ticker').alias('ticker'), F.first('date').alias('date'))