I have a dataframe which is having the following timestamps. Total of 5600000 rows!
time SD 2016-01-02 11:23:04.299000+00:00 1 2016-02-02 11:23:04.299000+00:00 2 2016-03-02 11:23:04.299000+00:00 2 2016-04-01 11:23:04.299000+00:00 3
Because I am doing the following operation, it throws an error.
(TypeError: Timestamp subtraction must have the same timezones or no timezones)
(pd.to_datetime("2016-01-02 11:23:04.299000+00:00") - pd.Timestamp("1970-01-01"))
Here i know the problem, and it could be solved by (pd.to_datetime("2016-01-02 11:23:04") - pd.Timestamp("1970-01-01"))
I am looking for an optimized solution for this problem because i think it is not a good way to iterate all rows of “time”, it takes a lot of time.
there might be a change that i get different format than what i have right now in the dataframe.
is there anything i can do with “pandas”? Thanks in advance!
Advertisement
Answer
IIUC This will use numpy and pandas to get the data difference between the 2 columns without the second column having the timezones
import pandas as pd import numpy as np df['time'] = pd.to_datetime(df['time'], infer_datetime_format = True) df['date_to_subtract'] = pd.to_datetime('1970-01-01') df['difference'] = df['time'].values - df['date_to_subtract'].values