I have a file with timestamp column. When I try to read the file with a schema designed by myself it is populating the datetime column with null.
Source file has data as below
created_date 31-AUG-2016 02:48:38 31-AUG-2016 10:37:59 31-AUG-2016 23:37:51
where I am using the below code snippet
from pyspark.sql.types import * Raw_Schema = StructType([StructField("created_date",DateType(),True)]) DF = spark.read.csv("csv").option("header","true").schema(Raw_schema).load("path") DF.display() created_date null null null
in the above DF.display() is showing the result as null for all the inputs. However my expected output is as below:
Created_Date 31-08-2016 31-08-2016 31-08-2016
Advertisement
Answer
You need to provide the date format because the format in the csv file is non-standard.
df = (spark.read .format("csv") .option("header","true") .option("dateFormat", "dd-MMM-yyyy HH:mm:ss") .schema(Raw_schema) .load("filepath") ) df.show() +------------+ |created_date| +------------+ | 2016-08-31| | 2016-08-31| | 2016-08-31| +------------+