I’m kind of new to data science and Python.
First of all, do you suggest using any other Library than pandas when dealing with huge dataset (100K+ rows)?
Second of all, let me expose to you my current problem.
I have a Dataset in which I have a Datetime column, to make it easy to understand, let’s say I only have a Datetime column named date_col
.
Here’s what my date_col
values looks like:
df=pd.DataFrame({'dt_col': ["2019-03-13 08:12:23", "2019-03-13 07:10:18", "2019-03-13 08:12:23", "2019-03-15 10:35:53", "2019-03-20 11:12:23", "2019-03-20 08:12:23"]}) dt_col 0 2019-03-13 08:12:23 1 2019-03-13 07:10:18 2 2019-03-13 08:12:23 3 2019-03-15 10:35:53 4 2019-03-20 11:12:23 5 2019-03-20 08:12:23
I want to extract foreach day the minimum and the maximum hour or datetime
, for example for 2019-03-13
, I want to extract 2019-03-13 07:10:18
and 2019-03-13 08:12:23
.
I thought about:
- Getting distinct dates without the time from my DataFrame
- Foreach of these dates, getting the min and max corresponding date from my Dataframe
I’m kind of stuck at step 2 as I don’t know how to really achieve this in Python, I mean I can do it the “old way” with some loops but I don’t think that it will do the job with a large Dataset.
Btw, here’s what I’ve done for step 1:
dates=pd.to_datetime(df.dt_col) distinc_dates=dates.dt.strftime("%Y-%m-%d").unique()
Once I got those min and max, I want to generate datetime rows between each min and max datetime, for example between 2019-03-13 07:10:18
and 2019-03-13 08:12:23
, I want to get 2019-03-13 07:10:18
, 2019-03-13 07:10:19
, 2019-03-13 07:10:20
, 2019-03-13 07:10:21
, 2019-03-13 07:10:22
,….. until 2019-03-13 08:12:23
.
I think this can be achieved using pd.date_range
. So once I have got my min and max, I’m thinking user using pd.date_tange
to do something like this:
dates=[] for index,row in df.iterrows(): dates.append(pd.date_range(start=row['min'], end=row['max'], freq='1S')) print(dates)
But I know that iterrows is slow asf, so I’m asking you guys for the best way to achieve this when having huge dataset.
Advertisement
Answer
In case dt_col
is not dtype datetime
, you need convert it to datetime
df.dt_col = pd.to_datetime(df.dt_col)
Next, try this
df1 = df.groupby(df.dt_col.dt.date).dt_col.agg(['min', 'max']) Out[555]: min max dt_col 2019-03-13 2019-03-13 07:10:18 2019-03-13 08:12:23 2019-03-15 2019-03-15 10:35:53 2019-03-15 10:35:53 2019-03-20 2019-03-20 08:12:23 2019-03-20 11:12:23
After having min and max. You may create range in seconds by pd.date_range
or resampling. I think pd.date_range
with listcomp may faster resampling in your case. Here it is
time_arr = [pd.date_range(df1.loc[ix,'min'], df1.loc[ix,'max'], freq='S') for ix in df1.index]
Or
time_arr = [pd.date_range(x[0], x[1], freq='S') for x in df1.values] Out[640]: [DatetimeIndex(['2019-03-13 07:10:18', '2019-03-13 07:10:19', '2019-03-13 07:10:20', '2019-03-13 07:10:21', '2019-03-13 07:10:22', '2019-03-13 07:10:23', '2019-03-13 07:10:24', '2019-03-13 07:10:25', '2019-03-13 07:10:26', '2019-03-13 07:10:27', ... '2019-03-13 08:12:14', '2019-03-13 08:12:15', '2019-03-13 08:12:16', '2019-03-13 08:12:17', '2019-03-13 08:12:18', '2019-03-13 08:12:19', '2019-03-13 08:12:20', '2019-03-13 08:12:21', '2019-03-13 08:12:22', '2019-03-13 08:12:23'], dtype='datetime64[ns]', length=3726, freq='S'), DatetimeIndex(['2019-03-15 10:35:53'], dtype='datetime64[ns]', freq='S'), DatetimeIndex(['2019-03-20 08:12:23', '2019-03-20 08:12:24', '2019-03-20 08:12:25', '2019-03-20 08:12:26', '2019-03-20 08:12:27', '2019-03-20 08:12:28', '2019-03-20 08:12:29', '2019-03-20 08:12:30', '2019-03-20 08:12:31', '2019-03-20 08:12:32', ... '2019-03-20 11:12:14', '2019-03-20 11:12:15', '2019-03-20 11:12:16', '2019-03-20 11:12:17', '2019-03-20 11:12:18', '2019-03-20 11:12:19', '2019-03-20 11:12:20', '2019-03-20 11:12:21', '2019-03-20 11:12:22', '2019-03-20 11:12:23'], dtype='datetime64[ns]', length=10801, freq='S')]
Note: if you dataset is too big and you create range by seconds, you may run out of memory and crash.