Is pandas.read_spss misreading datetime into unix?

Question

I have a sav file with a datetime column in %m/%d/%Y string format. When I read it in with pd.read_spss(), which doesn't seem to have any datetime-related arguments, it ends up in what looks like unix time, except that the time would be a few centuries from now with unique values including 13778726400, 13841884800, etc. When I feed the read

Accepted Answer

Dates, times and datetimes are always stored in SPSS as a number and then you add a format for displaying. SPSS continuously adds new formats while removes others. New formats have to be added manually to the pyreadstat code, while old formats stay in the code for backward compatibility. So the problem is you have found a new Date/datetime/time format that is not registered in pyreadstat.Another workaround would have been to open the file in SPSS and store it as a date/datetime/time, but with a different format pyreadstat would recognise, for example DATE11, DATETIME20 etc (the current list that pyreadstat accepts is [https://github.com/Roche/pyreadstat/blob/master/pyreadstat/_readstat_parser.pyx#L52-L54])The best when this is found is to submit a github issue describing the new format found for it to be added. I just added a few I found in the most recent SPSS documentation, and hopefully your problem should be solved in the next release (already available on dev). If not, please submit an issue with a reproducible example.The numbers SPSS uses to store the dates are not unix time, but either the number of seconds (in the case of datetimes or time) or days (in the case of dates) since 1582-10-14 (the start of the Gregorian Calendar. So you would need something like this to calculate it manually:from datetime import datetime, date, timedeltaorigin = date(1582, 10, 14)myspssvalue = 13778726400delta = timedelta(seconds=myspssvalue) # or days=myspssvalue if datepython_date = origin + deltaprint(python_date)#datetime.date(2019, 6, 1)Also, if you would assume this number is number of seconds since 1970:>>> datetime.fromtimestamp(13778726400)datetime.datetime(2406, 8, 19, 2, 0)What pandas is doing is thinking that the number you are giving is the number of nanoseconds since 1970-01-01 (it transforms it to datetime64[ns]) and that is why you get a date very close to 1970

Advertisement

Answer