Skip to content
Advertisement

Why are python dates such a mess and what can I do about it?

A common source of errors in my Python codebase are dates.

Specifically, the different implementations of dates and datetimes, and how comparisons are handled between them.

These are the date types in my codebase

import datetime
import pandas as pd 
import polars as pl 

x1 = pd.to_datetime('2020-10-01')
x2 = datetime.datetime(2020, 10,1)
x3 = pl.DataFrame({'i':[x2]}).select(pl.col('i').cast(pl.Date)).to_numpy()[0,0]
x4 = pl.DataFrame({'i':[x2]}).select(pl.col('i').cast(pl.Datetime)).to_numpy()[0,0]
x5 = pendulum.parse('2020-10-01')
x6 = x5.date()
x7 = x1.date()

You can print them to see:

x1=2020-10-01 00:00:00           , type(x1)=<class 'pandas._libs.tslibs.timestamps.Timestamp'>
x2=2020-10-01 00:00:00           , type(x2)=<class 'datetime.datetime'>
x3=2020-10-01                    , type(x3)=<class 'numpy.datetime64'>
x4=2020-10-01T00:00:00.000000    , type(x4)=<class 'numpy.datetime64'>
x5=2020-10-01T00:00:00+00:00     , type(x5)=<class 'pendulum.datetime.DateTime'>
x6=2020-10-01                    , type(x6)=<class 'pendulum.date.Date'>
x7=2020-10-01                    , type(x7)=<class 'datetime.date'>

Is there a canonical date representation in Python? I suppose x7: datetime.date is probably closest…

Also, note comparisons are a nightmare, see here a table of trying to do xi == xj

x1 x2 x3 x4 x5 x6 x7
x1: <class ‘pandas._libs.tslibs.timestamps.Timestamp’> True True ERROR: Only resolutions ‘s’, ‘ms’, ‘us’, ‘ns’ are supported. True False True True
x2: <class ‘datetime.datetime’> True True False True False False False
x3: <class ‘numpy.datetime64’> True False True True False True True
x4: <class ‘numpy.datetime64’> True True True True False False False
x5: <class ‘pendulum.datetime.DateTime’> False False False False True False False
x6: <class ‘pendulum.date.Date’> True True True False False True True
x7: <class ‘datetime.date’> True False True False False True True

Also note it’s not even symmetric:

enter image description here

The pain is that comparisons are even stranger. Here is xi>=xj:

Red represents an ERROR:

enter image description here

As you can imagine, there is an ever growing amount of glue code to keep this under control. Is there any advice on how to handle date & datetime types in Python?

For simplicity:

  • I never need timezone data, everything should always be UTC
  • Sometimes dates are passed around as strings for convenience (eg. parsed from a JSON)
  • I at most need seconds resolution, but 99% of my work uses only dates.

Advertisement

Answer

All listed types can be converted to numpy datetime64. If you don’t need more than seconds resolution, you might set the unit to ‘s’ (optional). Ex:

# Python datetime.datetime
x2_np = np.datetime64(x2.replace(tzinfo=None), 's')
print(x2_np, repr(x2_np))
# 2020-10-01T00:00:00 numpy.datetime64('2020-10-01T00:00:00')

# Python datetime.date
x6_np = np.datetime64(x6, 's')
print(x6_np, repr(x6_np))
# 2020-10-01T00:00:00 numpy.datetime64('2020-10-01T00:00:00')

# pendulum datetime
x5_np = np.datetime64(x5.replace(tzinfo=None), 's')
print(x5_np, repr(x5_np))
# 2020-10-01T00:00:00 numpy.datetime64('2020-10-01T00:00:00')

# pd.Timestamp
x1_np = x1.to_numpy().astype('datetime64[s]')
print(x1_np, repr(x1_np))
# 2020-10-01T00:00:00 numpy.datetime64('2020-10-01T00:00:00')

Since numpy tries to avoid time zones (defaults to UTC), make sure to replace the tzinfo for datetime.datetime and pendulum.datetime, should it be set there.

Now you could put this all in one converter function that is essentially a big switch case. Use with caution on big datasets however, convenience does not come for free most of the time. Ex:

def convert_dt_to_numpy(dt, unit='s'):
    if isinstance(dt, (datetime.datetime, pendulum.DateTime)):
        return np.datetime64(dt.replace(tzinfo=None), unit)
    if isinstance(dt, (datetime.date, pendulum.Date)):
        return np.datetime64(dt, unit)
    if isinstance(dt, pd.Timestamp):
        return dt.to_numpy().astype(f'datetime64[{unit}]')
    raise ValueError(f"conversion for '{dt}' of {type(dt)} unknown")
    
for dt in (x1, x2, x6, x5, 7):
    print(convert_dt_to_numpy(dt))
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement