I want to know if there is a better way of transforming a date column into a datetime column + 1 hour than the method I am currently using.
Here is my dataframe:
JavaScript
x
9
1
df = sc.parallelize([
2
['2019-08-29'],
3
['2019-08-30'],
4
['2019-09-1'],
5
['2019-09-2'],
6
['2019-09-4'],
7
['2019-09-10']
8
]).toDF(['DATE']).withColumn('DATE',col('DATE').cast('date'))
9
My code:
JavaScript
1
5
1
df1 = df.withColumn(
2
'DATETIME',
3
((col('DATE').cast('timestamp').cast('long')+3600)).cast('timestamp')
4
)
5
Which gives the output:
JavaScript
1
11
11
1
+----------+-------------------+
2
| DATE| DATETIME|
3
+----------+-------------------+
4
|2019-08-29|2019-08-29 01:00:00|
5
|2019-08-30|2019-08-30 01:00:00|
6
|2019-09-01|2019-09-01 01:00:00|
7
|2019-09-02|2019-09-02 01:00:00|
8
|2019-09-04|2019-09-04 01:00:00|
9
|2019-09-10|2019-09-10 01:00:00|
10
+----------+-------------------+
11
Does anyone know a more efficient way of doing this. Casting to a timestamp twice seems a bit clumsy.
Many thanks.
Advertisement
Answer
you can use something like this:
JavaScript
1
4
1
from pyspark.sql.functions import expr
2
df1 = df.withColumn('DATETIME',
3
col('DATE').cast('timestamp')+ expr('INTERVAL 1 HOURS'))
4
then you can read more about syntax for intervals, for example, in following blog post from Databricks.