New Python user here, so I appreciate any ideas for best practices in addition to the issue I’m seeking advice on.
I have code that works for a small number of records, but when I run it on a large dataframe, it takes too long. I’ve done a lot of reading on this issue, and there are several similar posts on SO, but none have given me enough clues to resolve this. The code I have works, but it takes too long for the large input dataset I need to process thru it. First, let me show the input data:
ID | bh_start_dt_list | bh_$_amt_list | bh_cnt | orig_mth_$ | group_rank | cal_yr | dur_mth | bh_start_index |
---|---|---|---|---|---|---|---|---|
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 1 | 2005 | 8 | -1 |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 2 | 2005 | 9 | NaN |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 3 | 2005 | 10 | NaN |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 4 | 2005 | 11 | NaN |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 5 | 2005 | 12 | NaN |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 6 | 2006 | 1 | NaN |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 7 | 2006 | 2 | NaN |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 8 | 2006 | 3 | NaN |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 9 | 2006 | 4 | NaN |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 10 | 2006 | 5 | NaN |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 11 | 2006 | 6 | NaN |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 12 | 2006 | 7 | 0 |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 13 | 2006 | 8 | NaN |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 14 | 2006 | 9 | NaN |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 15 | 2006 | 10 | NaN |
bh_cnt is the count of list items in each bh_dt_start_list & bh_$_amt_list
bh_dt_start_list corresponds to the list of items in bh_$_amt_list
group_rank was assigned using this code:
df['group_rank'] = df.groupby(['ID'])['ID'].rank(method="first", ascending=True)
Thus, when the ID changes, the ranking starts over at 1.
bh_start_index was assigned using this code:
def bh_idx_df(x): return next((i for i, y in enumerate(x['bh_start_dt_list']) if int(str(y)[0:4]) == x['cal_yr'] and int(str(y)[5:7]) == x['dur_mth']), None) df['bh_start_index'] = df.apply(bh_idx_df, axis=1)
So, essentially if the dur_ mth month and cal_yr year match a date in the bh_start_dt_list list, I ultimately want to populate a new column with the corresponding indexed value from the bh_$_amt_list. I thought that identifying the actual index first was prudent. As shown in the data above, the first date match was found on July of 2006 (bh_start_index == 0). I have separate code to assign bh_start_index = -1 (where group_rank == 1) for the first rank row value so that I can populate it with the default $ amount in the orig_mth_$ column.
This code so far all works fine, I think. I haven’t tested the specific loop in the function [next((i for i, y in enumerate(x[‘bh_start_dt_list’])] against all data yet. It’s been my experience that Python does not do well with loops on large datasets. I’ll be running millions of records thru this code. I’m trying to write code that works, and then refine it. Please let me know if you think this loop is part of the inefficiency problem.
Finally, the NaN values in the bh_start_index column need to be populated too. Once each of those has a value, it will be easy to reference that index to populate a new column with a $ amount from the bh_$_amt_list. Here is the code I was using to do that:
def bh_adj_idx_df(x): for i in range(-1, int(x['bh_cnt'])): # 1st BH record if i == -1: if x['group_rank'] <= list(DF.loc[DF['bh_start_index'] == 0]['group_rank'])[0]: return i else: # Last BH record if i == x['bh_cnt'] - 1: if x['group_rank'] > list(DF.loc[DF['bh_start_index'] == i]['group_rank'])[0]: return i else: # All other BH records if list(DF.loc[DF['bh_start_index'] == i]['group_rank'])[0] < x['group_rank'] <= list(DF.loc[DF['bh_start_index'] == i + 1]['group_rank'])[0]: return i df['bh_adj_index'] = df.apply(bh_adj_idx_df, axis=1)
I wrote the dataframe DF in caps to show what I believe is a problem. The function bh_adj_idx_df(x) should reference (x) throughout the function code, but when I only use that alias I get this error:
KeyError: 'False: boolean label can not be used without a boolean index'
Why does this mean, and why does using the literal DF dataframe name fix this error? I assume that blending the two dataframe names references in the function is a cause of the inefficiency.
Here is the desired but inefficient output data from that code:
ID | bh_start_dt_list | bh_$_amt_list | bh_cnt | orig_mth_$ | group_rank | cal_yr | dur_mth | bh_start_index | bh_adj_index |
---|---|---|---|---|---|---|---|---|---|
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 1 | 2005 | 8 | -1 | -1 |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 2 | 2005 | 9 | NaN | -1 |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 3 | 2005 | 10 | NaN | -1 |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 4 | 2005 | 11 | NaN | -1 |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 5 | 2005 | 12 | NaN | -1 |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 6 | 2006 | 1 | NaN | -1 |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 7 | 2006 | 2 | NaN | -1 |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 8 | 2006 | 3 | NaN | -1 |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 9 | 2006 | 4 | NaN | -1 |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 10 | 2006 | 5 | NaN | -1 |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 11 | 2006 | 6 | NaN | -1 |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 12 | 2006 | 7 | 0 | -1 |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 13 | 2006 | 8 | NaN | 0 |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 14 | 2006 | 9 | NaN | 0 |
123 | [2006-07-26, 2016-01-12, 2016-02-22] | [16.72, 15.48, 16.72] | 3 | 15.48 | 15 | 2006 | 10 | NaN | 0 |
I want the default (-1) to populate up thru the row with the first identified date from bh_start_dt_list . Then the following row would get the corresponding index value in bh_$_amt_list, until a new date was identified and so on. Here is the code to assign the dollar amount based on the index value:
def ep_be_df(x): if x['bh_adj_index'] == -1: return x['orig_mth_$'] else: return x['bh_$_amt_list'][int(x['bh_adj_index'])] df['the_correct_amt'] = df.apply(ep_be_df, axis=1)
If you read this far, congratulations and I thank you.
Can anyone suggest any numpy applications for the code I’m trying to do here? I understand numpy.where is much more efficient.
Can anyone see a way to do this without looping thru the date list?
Regards,
Tim
Advertisement
Answer
Long question – it might be easier to answer if it can be simplified into a more concise example. I’ll discuss it a bit more broadly below for now, hoping to help you to the next step on the way to solving this.
Pandas and numpy works well on tabular data which your data example doesn’t quite look like to me. The goal would be to avoid nested looping over the data with python syntax which is slow. I’d reformat the input to tidy data – one observation per row, one variable per column, one single value per cell – and then the efficient numpy vectorized ops will be much easier to apply. We can avoid looping though the list in python, this can be done with pandas/numpy now.
In your example we’d denormalize the data into something more tabular Specifically I’d unstack the lists nested within the cells into separate observations/rows which would result in repeated data for the cells.
Once we have this new denormalized input data format without the lists I believe it will be easier to see how we can use numpy to solve this – which feels to me very doable but obscure due to the input data format.
Exampel of ‘Tidy’ version of the input data:
ID | bh_start_dt_list | bh_$_amt_list | bh_cnt | orig_mth_$ | group_rank | cal_yr | dur_mth | bh_start_index |
---|---|---|---|---|---|---|---|---|
123 | 26/07/2006 | 16.72 | 3 | 15.48 | 1 | 2005 | 8 | -1 |
123 | 12/01/2016 | 15.48 | 3 | 15.48 | 1 | 2005 | 8 | -1 |
123 | 22/02/2016 | 16.72 | 3 | 15.48 | 1 | 2005 | 8 | -1 |
123 | 26/07/2006 | 16.72 | 3 | 15.48 | 2 | 2005 | 9 | NaN |
123 | 12/01/2016 | 15.48 | 3 | 15.48 | 2 | 2005 | 9 | NaN |
123 | 22/02/2016 | 16.72 | 3 | 15.48 | 2 | 2005 | 9 | NaN |