Skip to content
Advertisement

Converting pandas dataframe to PySpark dataframe drops index

I’ve got a pandas dataframe called data_clean. It looks like this: enter image description here

I want to convert it to a Spark dataframe, so I use the createDataFrame() method: sparkDF = spark.createDataFrame(data_clean)

However, that seems to drop the index column (the one that has the names ali, anthony, bill, etc) from the original dataframe. The output of

sparkDF.printSchema()
sparkDF.show()

is

root
 |-- transcript: string (nullable = true)

+--------------------+
|          transcript|
+--------------------+
|ladies and gentle...|
|thank you thank y...|
| all right thank ...|
|                    |
|this is dave he t...|
|                    |
|   ladies and gen...|
|   ladies and gen...|
|armed with boyish...|
|introfade the mus...|
|wow hey thank you...|
|hello hello how y...|
+--------------------+

The docs say createDataFrame() can take a pandas.DataFrame as an input. I’m using Spark version ‘3.0.1’.

Other questions on SO related to this don’t mention this problem of the index column disappearing:

I’m probably missing something obvious, but how do I get to keep the index column when I convert from a pandas dataframe to a PySpark dataframe?

Advertisement

Answer

Spark DataFrame has no concept of index, so if you want to preserve it, you have to assign it to a column first using reset_index in a pandas dataframe

You can also use inplace to avoid additional memory overhead while resting the index

df.reset_index(drop=False,inplace=True)

sparkDF = sqlContext.createDataFrame(df)
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement