A common source of errors in my Python codebase are dates.
Specifically, the different implementations of dates and datetimes, and how comparisons are handled between them.
These are the date types in my codebase
import datetime import pandas as pd import polars as pl x1 = pd.to_datetime('2020-10-01') x2 = datetime.datetime(2020, 10,1) x3 = pl.DataFrame({'i':[x2]}).select(pl.col('i').cast(pl.Date)).to_numpy()[0,0] x4 = pl.DataFrame({'i':[x2]}).select(pl.col('i').cast(pl.Datetime)).to_numpy()[0,0] x5 = pendulum.parse('2020-10-01') x6 = x5.date() x7 = x1.date()
You can print them to see:
x1=2020-10-01 00:00:00 , type(x1)=<class 'pandas._libs.tslibs.timestamps.Timestamp'> x2=2020-10-01 00:00:00 , type(x2)=<class 'datetime.datetime'> x3=2020-10-01 , type(x3)=<class 'numpy.datetime64'> x4=2020-10-01T00:00:00.000000 , type(x4)=<class 'numpy.datetime64'> x5=2020-10-01T00:00:00+00:00 , type(x5)=<class 'pendulum.datetime.DateTime'> x6=2020-10-01 , type(x6)=<class 'pendulum.date.Date'> x7=2020-10-01 , type(x7)=<class 'datetime.date'>
Is there a canonical date representation in Python? I suppose x7: datetime.date
is probably closest…
Also, note comparisons are a nightmare, see here a table of trying to do xi == xj
x1 | x2 | x3 | x4 | x5 | x6 | x7 | |
---|---|---|---|---|---|---|---|
x1: <class ‘pandas._libs.tslibs.timestamps.Timestamp’> | True | True | ERROR: Only resolutions ‘s’, ‘ms’, ‘us’, ‘ns’ are supported. | True | False | True | True |
x2: <class ‘datetime.datetime’> | True | True | False | True | False | False | False |
x3: <class ‘numpy.datetime64’> | True | False | True | True | False | True | True |
x4: <class ‘numpy.datetime64’> | True | True | True | True | False | False | False |
x5: <class ‘pendulum.datetime.DateTime’> | False | False | False | False | True | False | False |
x6: <class ‘pendulum.date.Date’> | True | True | True | False | False | True | True |
x7: <class ‘datetime.date’> | True | False | True | False | False | True | True |
Also note it’s not even symmetric:
The pain is that comparisons are even stranger. Here is xi>=xj:
Red represents an ERROR
:
As you can imagine, there is an ever growing amount of glue code to keep this under control. Is there any advice on how to handle date & datetime types in Python?
For simplicity:
- I never need timezone data, everything should always be UTC
- Sometimes dates are passed around as strings for convenience (eg. parsed from a JSON)
- I at most need seconds resolution, but 99% of my work uses only dates.
Advertisement
Answer
All listed types can be converted to numpy datetime64. If you don’t need more than seconds resolution, you might set the unit to ‘s’ (optional). Ex:
# Python datetime.datetime x2_np = np.datetime64(x2.replace(tzinfo=None), 's') print(x2_np, repr(x2_np)) # 2020-10-01T00:00:00 numpy.datetime64('2020-10-01T00:00:00') # Python datetime.date x6_np = np.datetime64(x6, 's') print(x6_np, repr(x6_np)) # 2020-10-01T00:00:00 numpy.datetime64('2020-10-01T00:00:00') # pendulum datetime x5_np = np.datetime64(x5.replace(tzinfo=None), 's') print(x5_np, repr(x5_np)) # 2020-10-01T00:00:00 numpy.datetime64('2020-10-01T00:00:00') # pd.Timestamp x1_np = x1.to_numpy().astype('datetime64[s]') print(x1_np, repr(x1_np)) # 2020-10-01T00:00:00 numpy.datetime64('2020-10-01T00:00:00')
Since numpy tries to avoid time zones (defaults to UTC), make sure to replace the tzinfo
for datetime.datetime and pendulum.datetime, should it be set there.
Now you could put this all in one converter function that is essentially a big switch case. Use with caution on big datasets however, convenience does not come for free most of the time. Ex:
def convert_dt_to_numpy(dt, unit='s'): if isinstance(dt, (datetime.datetime, pendulum.DateTime)): return np.datetime64(dt.replace(tzinfo=None), unit) if isinstance(dt, (datetime.date, pendulum.Date)): return np.datetime64(dt, unit) if isinstance(dt, pd.Timestamp): return dt.to_numpy().astype(f'datetime64[{unit}]') raise ValueError(f"conversion for '{dt}' of {type(dt)} unknown") for dt in (x1, x2, x6, x5, 7): print(convert_dt_to_numpy(dt))