I have a file with timestamp column. When I try to read the file with a schema designed by myself it is populating the datetime column with null.
Source file has data as below
JavaScript
x
5
1
created_date
2
31-AUG-2016 02:48:38
3
31-AUG-2016 10:37:59
4
31-AUG-2016 23:37:51
5
where I am using the below code snippet
JavaScript
1
11
11
1
from pyspark.sql.types import *
2
Raw_Schema = StructType([StructField("created_date",DateType(),True)])
3
4
DF = spark.read.csv("csv").option("header","true").schema(Raw_schema).load("path")
5
DF.display()
6
7
created_date
8
null
9
null
10
null
11
in the above DF.display() is showing the result as null for all the inputs. However my expected output is as below:
JavaScript
1
5
1
Created_Date
2
31-08-2016
3
31-08-2016
4
31-08-2016
5
Advertisement
Answer
You need to provide the date format because the format in the csv file is non-standard.
JavaScript
1
17
17
1
df = (spark.read
2
.format("csv")
3
.option("header","true")
4
.option("dateFormat", "dd-MMM-yyyy HH:mm:ss")
5
.schema(Raw_schema)
6
.load("filepath")
7
)
8
9
df.show()
10
+------------+
11
|created_date|
12
+------------+
13
| 2016-08-31|
14
| 2016-08-31|
15
| 2016-08-31|
16
+------------+
17