Skip to content
Advertisement

Getting min and max datime for each date in csv

I’m kind of new to data science and Python.

First of all, do you suggest using any other Library than pandas when dealing with huge dataset (100K+ rows)?

Second of all, let me expose to you my current problem.

I have a Dataset in which I have a Datetime column, to make it easy to understand, let’s say I only have a Datetime column named date_col.

Here’s what my date_col values looks like:

df=pd.DataFrame({'dt_col': ["2019-03-13 08:12:23", "2019-03-13 07:10:18", "2019-03-13 08:12:23", "2019-03-15 10:35:53", "2019-03-20 11:12:23", "2019-03-20 08:12:23"]})

                dt_col
0  2019-03-13 08:12:23
1  2019-03-13 07:10:18
2  2019-03-13 08:12:23
3  2019-03-15 10:35:53
4  2019-03-20 11:12:23
5  2019-03-20 08:12:23

I want to extract foreach day the minimum and the maximum hour or datetime, for example for 2019-03-13, I want to extract 2019-03-13 07:10:18 and 2019-03-13 08:12:23.

I thought about:

  1. Getting distinct dates without the time from my DataFrame
  2. Foreach of these dates, getting the min and max corresponding date from my Dataframe

I’m kind of stuck at step 2 as I don’t know how to really achieve this in Python, I mean I can do it the “old way” with some loops but I don’t think that it will do the job with a large Dataset.

Btw, here’s what I’ve done for step 1:

dates=pd.to_datetime(df.dt_col)
distinc_dates=dates.dt.strftime("%Y-%m-%d").unique()

Once I got those min and max, I want to generate datetime rows between each min and max datetime, for example between 2019-03-13 07:10:18 and 2019-03-13 08:12:23, I want to get 2019-03-13 07:10:18, 2019-03-13 07:10:19, 2019-03-13 07:10:20, 2019-03-13 07:10:21, 2019-03-13 07:10:22,….. until 2019-03-13 08:12:23.

I think this can be achieved using pd.date_range. So once I have got my min and max, I’m thinking user using pd.date_tange to do something like this:

dates=[]
for index,row in df.iterrows():
    dates.append(pd.date_range(start=row['min'], end=row['max'], freq='1S'))
print(dates)

But I know that iterrows is slow asf, so I’m asking you guys for the best way to achieve this when having huge dataset.

Advertisement

Answer

In case dt_col is not dtype datetime, you need convert it to datetime

df.dt_col = pd.to_datetime(df.dt_col)

Next, try this

df1 = df.groupby(df.dt_col.dt.date).dt_col.agg(['min', 'max'])

Out[555]:
                           min                 max
dt_col
2019-03-13 2019-03-13 07:10:18 2019-03-13 08:12:23
2019-03-15 2019-03-15 10:35:53 2019-03-15 10:35:53
2019-03-20 2019-03-20 08:12:23 2019-03-20 11:12:23

After having min and max. You may create range in seconds by pd.date_range or resampling. I think pd.date_range with listcomp may faster resampling in your case. Here it is

time_arr = [pd.date_range(df1.loc[ix,'min'], df1.loc[ix,'max'], freq='S') 
                       for ix in df1.index]

Or

time_arr = [pd.date_range(x[0], x[1], freq='S') for x in df1.values]

Out[640]:
[DatetimeIndex(['2019-03-13 07:10:18', '2019-03-13 07:10:19',
                '2019-03-13 07:10:20', '2019-03-13 07:10:21',
                '2019-03-13 07:10:22', '2019-03-13 07:10:23',
                '2019-03-13 07:10:24', '2019-03-13 07:10:25',
                '2019-03-13 07:10:26', '2019-03-13 07:10:27',
                ...
                '2019-03-13 08:12:14', '2019-03-13 08:12:15',
                '2019-03-13 08:12:16', '2019-03-13 08:12:17',
                '2019-03-13 08:12:18', '2019-03-13 08:12:19',
                '2019-03-13 08:12:20', '2019-03-13 08:12:21',
                '2019-03-13 08:12:22', '2019-03-13 08:12:23'],
               dtype='datetime64[ns]', length=3726, freq='S'),
 DatetimeIndex(['2019-03-15 10:35:53'], dtype='datetime64[ns]', freq='S'),
 DatetimeIndex(['2019-03-20 08:12:23', '2019-03-20 08:12:24',
                '2019-03-20 08:12:25', '2019-03-20 08:12:26',
                '2019-03-20 08:12:27', '2019-03-20 08:12:28',
                '2019-03-20 08:12:29', '2019-03-20 08:12:30',
                '2019-03-20 08:12:31', '2019-03-20 08:12:32',
                ...
                '2019-03-20 11:12:14', '2019-03-20 11:12:15',
                '2019-03-20 11:12:16', '2019-03-20 11:12:17',
                '2019-03-20 11:12:18', '2019-03-20 11:12:19',
                '2019-03-20 11:12:20', '2019-03-20 11:12:21',
                '2019-03-20 11:12:22', '2019-03-20 11:12:23'],
               dtype='datetime64[ns]', length=10801, freq='S')]

Note: if you dataset is too big and you create range by seconds, you may run out of memory and crash.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement