I want to know if there is a better way of transforming a date column into a datetime column + 1 hour than the method I am currently using. Here is my dataframe: My code: Which gives the output: Does anyone know a more efficient way of doing this. Casting to a timestamp twice seems a bit clumsy. Many thanks.

Most efficient way of transforming a date column to a timestamp column + an hour

I want to know if there is a better way of transforming a date column into a datetime column + 1 hour than the method I am currently using.

Here is my dataframe:

df = sc.parallelize([
    ['2019-08-29'],
    ['2019-08-30'],
    ['2019-09-1'],
    ['2019-09-2'],
    ['2019-09-4'],
    ['2019-09-10']
]).toDF(['DATE']).withColumn('DATE',col('DATE').cast('date'))

JavaScript
​x
 
df = sc.parallelize([
    ['2019-08-29'],
    ['2019-08-30'],
    ['2019-09-1'],
    ['2019-09-2'],
    ['2019-09-4'],
    ['2019-09-10']
]).toDF(['DATE']).withColumn('DATE',col('DATE').cast('date'))
​

My code:

df1 = df.withColumn(
    'DATETIME', 
    ((col('DATE').cast('timestamp').cast('long')+3600)).cast('timestamp')
)

JavaScript
 
df1 = df.withColumn(
    'DATETIME', 
    ((col('DATE').cast('timestamp').cast('long')+3600)).cast('timestamp')
)
​

Which gives the output:

+----------+-------------------+
|      DATE|           DATETIME|
+----------+-------------------+
|2019-08-29|2019-08-29 01:00:00|
|2019-08-30|2019-08-30 01:00:00|
|2019-09-01|2019-09-01 01:00:00|
|2019-09-02|2019-09-02 01:00:00|
|2019-09-04|2019-09-04 01:00:00|
|2019-09-10|2019-09-10 01:00:00|
+----------+-------------------+

JavaScript
 
+----------+-------------------+
|      DATE|           DATETIME|
+----------+-------------------+
|2019-08-29|2019-08-29 01:00:00|
|2019-08-30|2019-08-30 01:00:00|
|2019-09-01|2019-09-01 01:00:00|
|2019-09-02|2019-09-02 01:00:00|
|2019-09-04|2019-09-04 01:00:00|
|2019-09-10|2019-09-10 01:00:00|
+----------+-------------------+
​

Does anyone know a more efficient way of doing this. Casting to a timestamp twice seems a bit clumsy.

Many thanks.

Answer

you can use something like this:

from pyspark.sql.functions import expr
df1 = df.withColumn('DATETIME', 
     col('DATE').cast('timestamp')+ expr('INTERVAL 1 HOURS'))

JavaScript
 
from pyspark.sql.functions import expr
df1 = df.withColumn('DATETIME', 
     col('DATE').cast('timestamp')+ expr('INTERVAL 1 HOURS'))
​

then you can read more about syntax for intervals, for example, in following blog post from Databricks.

Advertisement

Answer