I am trying to convert a SAS proc transpose statement to pyspark in databricks. With the following data as a sample:
JavaScript
x
2
1
data = [{"duns":1234, "finc stress":100,"ver":6.0},{"duns":1234, "finc stress":125,"ver":7.0},{"duns":1234, "finc stress":135,"ver":7.1},{"duns":12345, "finc stress":125,"ver":7.6}]
2
I would expect the result to look like this
I tried using the pandas pivot_table() function with the following code however I ran into some performance issues with the size of the data:
JavaScript
1
4
1
tst = (df.pivot_table(index=['duns'], columns=['ver'], values='finc stress')
2
.add_prefix('ver')
3
.reset_index())
4
Is there a way to translate the PROC Transpose SAS logic to Pyspark instead of using pandas?
I am trying something like this but am getting an error
JavaScript
1
21
21
1
tst= sparkdf.groupBy('duns').pivot('ver').agg('finc_stress').withColumn('ver')
2
3
AssertionError: all exprs should be Column
4
---------------------------------------------------------------------------
5
AssertionError Traceback (most recent call last)
6
<command-2507760044487307> in <module>
7
4 df = pd.DataFrame(data) # pandas
8
5
9
----> 6 tst= sparkdf.groupBy('duns').pivot('ver').agg('finc_stress').withColumn('ver')
10
7
11
8
12
13
/databricks/spark/python/pyspark/sql/group.py in agg(self, *exprs)
14
115 else:
15
116 # Columns
16
--> 117 assert all(isinstance(c, Column) for c in exprs), "all exprs should be Column"
17
118 jdf = self._jgd.agg(exprs[0]._jc,
18
119 _to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]]))
19
20
AssertionError: all exprs should be Column
21
If you could help me out I would so appreciate it! Thank you so much.
Advertisement
Answer
I don’t know how you create df from data but here is what I did:
JavaScript
1
5
1
import pyspark.pandas as ps
2
3
df = ps.DataFrame(data)
4
df['ver'] = df['ver'].astype('str')
5
Then your pandas code worked.
To use PySpark method, here is what I did:
JavaScript
1
2
1
sparkdf.groupBy('duns').pivot('ver').agg(F.first('finc stress'))
2