Can it iterate through the Pyspark groupBy dataframe without aggregation or count?
For example code in Pandas:
for i, d in df2: mycode .... ^^ if using pandas ^^ Is there a difference in how to iterate groupby in Pyspark or have to use aggregation and count?
Advertisement
Answer
At best you can use .first , .last to get respective values from the groupBy but not all in the way you can get in pandas.
ex:
from pyspark.sql import functions as f df.groupBy(df['some_col']).agg(f.first(df['col1']), f.first(df['col2'])).show()
Since their is a basic difference between the way the data is handled in pandas and spark not all functionalities can be used in the same way.
Their are a few work arounds to get what you want like:
for diamonds DataFrame:
+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+ |_c0|carat| cut|color|clarity|depth|table|price| x| y| z| +---+-----+---------+-----+-------+-----+-----+-----+----+----+----+ | 1| 0.23| Ideal| E| SI2| 61.5| 55.0| 326|3.95|3.98|2.43| | 2| 0.21| Premium| E| SI1| 59.8| 61.0| 326|3.89|3.84|2.31| | 3| 0.23| Good| E| VS1| 56.9| 65.0| 327|4.05|4.07|2.31| | 4| 0.29| Premium| I| VS2| 62.4| 58.0| 334| 4.2|4.23|2.63| | 5| 0.31| Good| J| SI2| 63.3| 58.0| 335|4.34|4.35|2.75| +---+-----+---------+-----+-------+-----+-----+-----+----+----+----+
You can use:
l=[x.cut for x in diamonds.select("cut").distinct().rdd.collect()] def groups(df,i): import pyspark.sql.functions as f return df.filter(f.col("cut")==i) #for multi grouping def groups_multi(df,i): import pyspark.sql.functions as f return df.filter((f.col("cut")==i) & (f.col("color")=='E'))# use | for or. for i in l: groups(diamonds,i).show(2)
output :
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+ |_c0|carat| cut|color|clarity|depth|table|price| x| y| z| +---+-----+-------+-----+-------+-----+-----+-----+----+----+----+ | 2| 0.21|Premium| E| SI1| 59.8| 61.0| 326|3.89|3.84|2.31| | 4| 0.29|Premium| I| VS2| 62.4| 58.0| 334| 4.2|4.23|2.63| +---+-----+-------+-----+-------+-----+-----+-----+----+----+----+ only showing top 2 rows +---+-----+-----+-----+-------+-----+-----+-----+----+----+----+ |_c0|carat| cut|color|clarity|depth|table|price| x| y| z| +---+-----+-----+-----+-------+-----+-----+-----+----+----+----+ | 1| 0.23|Ideal| E| SI2| 61.5| 55.0| 326|3.95|3.98|2.43| | 12| 0.23|Ideal| J| VS1| 62.8| 56.0| 340|3.93| 3.9|2.46| +---+-----+-----+-----+-------+-----+-----+-----+----+----+----+ ...
In Function groups you can decide what kind of grouping you want for the data. It is a simple filter condition but it will get you all the groups separately.