Skip to content

Tag: pyspark

Pyspark groupBy DataFrame without aggregation or count

Can it iterate through the Pyspark groupBy dataframe without aggregation or count? For example code in Pandas: Answer At best you can use .first , .last to get respective values from the groupBy but not all in the way you can get in pandas. ex: Since their is a basic difference between the way the data is han…

Read avro files in pyspark with PyCharm

I’m quite new to spark, I’ve imported pyspark library to pycharm venv and write below code: , everything seems to be okay but when I want to read avro file I get message: pyspark.sql.utils.AnalysisException: ‘Failed to find data source: avro. Avro is built-in but external data source module …

join two patrition dataframe pyspark

I have two dataframes with partition level 2. Dataframes are small probably around 100 rows each. df1 : df2: my final df will be join of df1 and df2 based on columnindex. But when I am joining two data frames as per below it looks it is shuffling and giving me the incorrect results. Is there any way I can