I want to create a date range for each customer in a dataset. Each customer has its own range. How can this be done without a for loop?
Sample data:
JavaScript
x
6
1
import pandas as pd
2
dates = ['2018-01', '2018-04', '2018-10', '2018-11', '2018-12', '2018-01', '2018-04']
3
customers = ['A', 'A', 'A', 'A', 'A', 'B', 'B']
4
df = pd.DataFrame({'customers':customers, 'date':dates})
5
df.head(10)
6
Now I want to have one month for each row for each customer, for their min and max dates respectively, to get:
JavaScript
1
6
1
import pandas as pd
2
dates = ['2018-01', '2018-02', '2018-03', '2018-04', '2018-05', '2018-06', '2018-07', '2018-08', '2018-09', '2018-10', '2018-11', '2018-12', '2018-01', '2018-02', '2018-03', '2018-04']
3
customers = ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B','B']
4
df1 = pd.DataFrame({'customers':customers, 'date':dates})
5
df1.head(16)
6
My attempt is to use a for loop, iterating through each customer, but it is too slow. How to make it faster?
JavaScript
1
21
21
1
def get_date_frame(start_date, end_date):
2
date_frame = pd.date_range(start=start_date, end=end_date, freq='MS')
3
date_frame = pd.DataFrame(pd.DataFrame(date_frame.astype(str))[0].str[:7])
4
date_frame.columns = ['date']
5
return date_frame
6
7
for idx, jk in (enumerate(['A', 'B'])):
8
guy = df[df['customers']==jk]['date'] #get the data for that customer
9
guy.reset_index(drop=True, inplace=True) #reset
10
11
start = guy[0] #first date
12
end = guy[len(guy)-1] #last date
13
14
dframe = get_date_frame(start, end) #get range of dates
15
dframe['customer'] = jk #add customer id
16
17
if idx == 0:
18
out = dframe.copy()
19
else:
20
out = pd.concat((out, dframe.copy()), axis = 0) #concat outputs
21
Advertisement
Answer
JavaScript
1
11
11
1
df['date'] = pd.to_datetime(df['date'], format='%Y-%d')
2
3
df2 = df.groupby(['customers']).apply(
4
lambda x: x.set_index('date')
5
.reindex(pd.date_range(start = x['date'].min(), end = x['date'].max()))
6
.ffill()
7
.rename_axis('date')
8
.reset_index())
9
10
print(df2)
11
JavaScript
1
19
19
1
date customers
2
customers
3
A 0 2018-01-01 A
4
1 2018-01-02 A
5
2 2018-01-03 A
6
3 2018-01-04 A
7
4 2018-01-05 A
8
5 2018-01-06 A
9
6 2018-01-07 A
10
7 2018-01-08 A
11
8 2018-01-09 A
12
9 2018-01-10 A
13
10 2018-01-11 A
14
11 2018-01-12 A
15
B 0 2018-01-01 B
16
1 2018-01-02 B
17
2 2018-01-03 B
18
3 2018-01-04 B
19
Further if you want to convert the date column then
JavaScript
1
4
1
df2 = df2.droplevel('customers') #drop the index customer
2
3
df2['date'] = df2['date'].dt.year.astype(str) +'-'+ df2['date'].dt.day.astype(str)
4
JavaScript
1
18
18
1
date customers
2
0 2018-1 A
3
1 2018-2 A
4
2 2018-3 A
5
3 2018-4 A
6
4 2018-5 A
7
5 2018-6 A
8
6 2018-7 A
9
7 2018-8 A
10
8 2018-9 A
11
9 2018-10 A
12
10 2018-11 A
13
11 2018-12 A
14
0 2018-1 B
15
1 2018-2 B
16
2 2018-3 B
17
3 2018-4 B
18