Skip to content
Advertisement

Converting pandas dataframe to PySpark dataframe drops index

I’ve got a pandas dataframe called data_clean. It looks like this: enter image description here

I want to convert it to a Spark dataframe, so I use the createDataFrame() method: sparkDF = spark.createDataFrame(data_clean)

However, that seems to drop the index column (the one that has the names ali, anthony, bill, etc) from the original dataframe. The output of

JavaScript

is

JavaScript

The docs say createDataFrame() can take a pandas.DataFrame as an input. I’m using Spark version ‘3.0.1’.

Other questions on SO related to this don’t mention this problem of the index column disappearing:

I’m probably missing something obvious, but how do I get to keep the index column when I convert from a pandas dataframe to a PySpark dataframe?

Advertisement

Answer

Spark DataFrame has no concept of index, so if you want to preserve it, you have to assign it to a column first using reset_index in a pandas dataframe

You can also use inplace to avoid additional memory overhead while resting the index

JavaScript
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement