Converting pandas dataframe to PySpark dataframe d…

I’ve got a pandas dataframe called data_clean. It looks like this:

I want to convert it to a Spark dataframe, so I use the createDataFrame() method: sparkDF = spark.createDataFrame(data_clean)

However, that seems to drop the index column (the one that has the names ali, anthony, bill, etc) from the original dataframe. The output of

sparkDF.printSchema()
sparkDF.show()

root
 |-- transcript: string (nullable = true)

+--------------------+
|          transcript|
+--------------------+
|ladies and gentle...|
|thank you thank y...|
| all right thank ...|
|                    |
|this is dave he t...|
|                    |
|   ladies and gen...|
|   ladies and gen...|
|armed with boyish...|
|introfade the mus...|
|wow hey thank you...|
|hello hello how y...|
+--------------------+

The docs say createDataFrame() can take a pandas.DataFrame as an input. I’m using Spark version ‘3.0.1’.

Other questions on SO related to this don’t mention this problem of the index column disappearing:

This one about converting Pandas to Pyspark doesn’t mention this issue of the index column disappearing.
Same with this one
And this one relates to data dropping during conversion, but is more about window functions.

I’m probably missing something obvious, but how do I get to keep the index column when I convert from a pandas dataframe to a PySpark dataframe?

Answer

Spark DataFrame has no concept of index, so if you want to preserve it, you have to assign it to a column first using reset_index in a pandas dataframe

You can also use inplace to avoid additional memory overhead while resting the index

df.reset_index(drop=False,inplace=True)

sparkDF = sqlContext.createDataFrame(df)

Converting pandas dataframe to PySpark dataframe drops index

Advertisement

Answer