Skip to content
Advertisement

convert date month year time to date format pyspark

I have a file with timestamp column. When I try to read the file with a schema designed by myself it is populating the datetime column with null.

Source file has data as below

created_date
31-AUG-2016 02:48:38
31-AUG-2016 10:37:59
31-AUG-2016 23:37:51

where I am using the below code snippet

from pyspark.sql.types import *
Raw_Schema = StructType([StructField("created_date",DateType(),True)])

DF = spark.read.csv("csv").option("header","true").schema(Raw_schema).load("path")
DF.display()

created_date
null
null
null

in the above DF.display() is showing the result as null for all the inputs. However my expected output is as below:

Created_Date
31-08-2016 
31-08-2016 
31-08-2016 

Advertisement

Answer

You need to provide the date format because the format in the csv file is non-standard.

df = (spark.read
    .format("csv")
    .option("header","true")
    .option("dateFormat", "dd-MMM-yyyy HH:mm:ss")
    .schema(Raw_schema)
    .load("filepath")
)

df.show()
+------------+
|created_date|
+------------+
|  2016-08-31|
|  2016-08-31|
|  2016-08-31|
+------------+
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement