Pyspark groupBy DataFrame without aggregation or count

Question

Can it iterate through the Pyspark groupBy dataframe without aggregation or count? For example code in Pandas: Answer At best you can use .first , .last to get respective values from the groupBy but not all in the way you can get in pandas. ex: Since their is a basic difference between the way the data is handled in pandas

Accepted Answer

At best you can use .first , .last to get respective values from the groupBy but not all in the way you can get in pandas.ex:from pyspark.sql import functions as fdf.groupBy(df['some_col']).agg(f.first(df['col1']), f.first(df['col2'])).show()Since their is a basic difference between the way the data is handled in pandas and spark not all functionalities can be used in the same way.Their are a few work arounds to get what you want like:for diamonds DataFrame:+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+|_c0|carat|      cut|color|clarity|depth|table|price|   x|   y|   z|+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+|  1| 0.23|    Ideal|    E|    SI2| 61.5| 55.0|  326|3.95|3.98|2.43||  2| 0.21|  Premium|    E|    SI1| 59.8| 61.0|  326|3.89|3.84|2.31||  3| 0.23|     Good|    E|    VS1| 56.9| 65.0|  327|4.05|4.07|2.31||  4| 0.29|  Premium|    I|    VS2| 62.4| 58.0|  334| 4.2|4.23|2.63||  5| 0.31|     Good|    J|    SI2| 63.3| 58.0|  335|4.34|4.35|2.75|+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+You can use:l=[x.cut for x in diamonds.select("cut").distinct().rdd.collect()]def groups(df,i):  import pyspark.sql.functions as f  return df.filter(f.col("cut")==i)#for multi groupingdef groups_multi(df,i):  import pyspark.sql.functions as f  return df.filter((f.col("cut")==i) & (f.col("color")=='E'))# use | for or.for i in l:  groups(diamonds,i).show(2)output :+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+|_c0|carat|    cut|color|clarity|depth|table|price|   x|   y|   z|+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+|  2| 0.21|Premium|    E|    SI1| 59.8| 61.0|  326|3.89|3.84|2.31||  4| 0.29|Premium|    I|    VS2| 62.4| 58.0|  334| 4.2|4.23|2.63|+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+only showing top 2 rows+---+-----+-----+-----+-------+-----+-----+-----+----+----+----+|_c0|carat|  cut|color|clarity|depth|table|price|   x|   y|   z|+---+-----+-----+-----+-------+-----+-----+-----+----+----+----+|  1| 0.23|Ideal|    E|    SI2| 61.5| 55.0|  326|3.95|3.98|2.43|| 12| 0.23|Ideal|    J|    VS1| 62.8| 56.0|  340|3.93| 3.9|2.46|+---+-----+-----+-----+-------+-----+-----+-----+----+----+----+...In Function groups you can decide what kind of grouping you want for the data. It is a simple filter condition but it will get you all the groups separately.

Advertisement

Answer