I’ve got a pandas dataframe called data_clean
. It looks like this:
I want to convert it to a Spark dataframe, so I use the createDataFrame() method:
sparkDF = spark.createDataFrame(data_clean)
However, that seems to drop the index column (the one that has the names ali, anthony, bill, etc) from the original dataframe. The output of
sparkDF.printSchema() sparkDF.show()
is
root |-- transcript: string (nullable = true) +--------------------+ | transcript| +--------------------+ |ladies and gentle...| |thank you thank y...| | all right thank ...| | | |this is dave he t...| | | | ladies and gen...| | ladies and gen...| |armed with boyish...| |introfade the mus...| |wow hey thank you...| |hello hello how y...| +--------------------+
The docs say createDataFrame() can take a pandas.DataFrame
as an input. I’m using Spark version ‘3.0.1’.
Other questions on SO related to this don’t mention this problem of the index column disappearing:
- This one about converting Pandas to Pyspark doesn’t mention this issue of the index column disappearing.
- Same with this one
- And this one relates to data dropping during conversion, but is more about window functions.
I’m probably missing something obvious, but how do I get to keep the index column when I convert from a pandas dataframe to a PySpark dataframe?
Advertisement
Answer
Spark DataFrame has no concept of index, so if you want to preserve it, you have to assign it to a column first using reset_index
in a pandas dataframe
You can also use inplace
to avoid additional memory overhead while resting the index
df.reset_index(drop=False,inplace=True) sparkDF = sqlContext.createDataFrame(df)