Can it iterate through the Pyspark groupBy dataframe without aggregation or count?
For example code in Pandas:
JavaScript
x
6
1
for i, d in df2:
2
mycode .
3
4
^^ if using pandas ^^
5
Is there a difference in how to iterate groupby in Pyspark or have to use aggregation and count?
6
Advertisement
Answer
At best you can use .first , .last to get respective values from the groupBy but not all in the way you can get in pandas.
ex:
JavaScript
1
3
1
from pyspark.sql import functions as f
2
df.groupBy(df['some_col']).agg(f.first(df['col1']), f.first(df['col2'])).show()
3
Since their is a basic difference between the way the data is handled in pandas and spark not all functionalities can be used in the same way.
Their are a few work arounds to get what you want like:
for diamonds DataFrame:
JavaScript
1
10
10
1
+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+
2
|_c0|carat| cut|color|clarity|depth|table|price| x| y| z|
3
+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+
4
| 1| 0.23| Ideal| E| SI2| 61.5| 55.0| 326|3.95|3.98|2.43|
5
| 2| 0.21| Premium| E| SI1| 59.8| 61.0| 326|3.89|3.84|2.31|
6
| 3| 0.23| Good| E| VS1| 56.9| 65.0| 327|4.05|4.07|2.31|
7
| 4| 0.29| Premium| I| VS2| 62.4| 58.0| 334| 4.2|4.23|2.63|
8
| 5| 0.31| Good| J| SI2| 63.3| 58.0| 335|4.34|4.35|2.75|
9
+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+
10
You can use:
JavaScript
1
13
13
1
l=[x.cut for x in diamonds.select("cut").distinct().rdd.collect()]
2
def groups(df,i):
3
import pyspark.sql.functions as f
4
return df.filter(f.col("cut")==i)
5
6
#for multi grouping
7
def groups_multi(df,i):
8
import pyspark.sql.functions as f
9
return df.filter((f.col("cut")==i) & (f.col("color")=='E'))# use | for or.
10
11
for i in l:
12
groups(diamonds,i).show(2)
13
output :
JavaScript
1
17
17
1
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
2
|_c0|carat| cut|color|clarity|depth|table|price| x| y| z|
3
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
4
| 2| 0.21|Premium| E| SI1| 59.8| 61.0| 326|3.89|3.84|2.31|
5
| 4| 0.29|Premium| I| VS2| 62.4| 58.0| 334| 4.2|4.23|2.63|
6
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
7
only showing top 2 rows
8
9
+---+-----+-----+-----+-------+-----+-----+-----+----+----+----+
10
|_c0|carat| cut|color|clarity|depth|table|price| x| y| z|
11
+---+-----+-----+-----+-------+-----+-----+-----+----+----+----+
12
| 1| 0.23|Ideal| E| SI2| 61.5| 55.0| 326|3.95|3.98|2.43|
13
| 12| 0.23|Ideal| J| VS1| 62.8| 56.0| 340|3.93| 3.9|2.46|
14
+---+-----+-----+-----+-------+-----+-----+-----+----+----+----+
15
16
17
In Function groups you can decide what kind of grouping you want for the data. It is a simple filter condition but it will get you all the groups separately.