I’ve just completed a beginner’s course in python, so please bear with me if the code below doesn’t make sense or my issue is because of some rookie mistake.
I’ve been trying to put the learning to use by working with college production of NFL players, with a view to understanding which statistics can be predictive or at least correlate to NFL production. It turns out that there’s a lot of data out there so I have about 200 columns of data for 600 odd prospects from the last 20 years (just for running backs so far). However, one of the problems with this data is that each stat is only provided by the age the prospect was in that season giving me something like this:
GP 18 GP 19 GP 20 GP 21 GP 22 GP 23 50 14.0 13.0 14.0 NaN NaN NaN 51 14.0 14.0 14.0 NaN NaN NaN 53 13.0 12.0 11.0 NaN NaN NaN 56 10.0 13.0 9.0 13.0 NaN NaN 59 10.0 13.0 15.0 NaN NaN NaN 61 NaN NaN 11.0 11.0 NaN NaN 66 NaN 12.0 13.0 12.0 2.0 13.0
What I want to do at the moment is to be able to take the last year of college production and put it into a new column (for 17 different statistics). I’ve therefore defined the following function:
def grab_latest(row): if row[f'{column} 23'] != 'NaN': return row[f'{column} 23'] elif row[f'{column} 22'] != 'NaN': return row[f'{column} 22'] elif row[f'{column} 21'] != 'NaN': return row[f'{column} 21'] elif row[f'{column} 20'] != 'NaN': return row[f'{column} 20'] elif row[f'{column} 19'] != 'NaN': return row[f'{column} 19'] elif row[f'{column} 18'] != 'NaN': return row[f'{column} 18']
Which I think should go backwards through the columns until I find a value which isn’t NaN, and then take that value as the output. I’ve then defined the columns via a list:
stats_list = ['% Yd Rec', 'PPR PPG', 'PPR', 'GP', 'Rush Att', 'Rush Yd', 'Rush TD', 'MS Rush Att', 'MS Rush Yd', 'MS Rush TD', 'REC', 'REC Yd', 'REC TD', 'MS REC', 'MS Rec Yd', 'MS REC TD', 'TOT DOM + TD MS']
and have then run the function through a for loop based on this list:
for column in stats_list: RB_df[f'{column} Last'] = RB_df.apply(lambda row: grab_latest(row), axis=1)
The result I’m getting back is a slightly bizarre one – the for loop appears to work, as all the new columns I’m expecting are created, however they are only populated with data where the player had an age 23 season. The remainder of indexes are filled with ‘NaN’:
GP Last 50 NaN 51 NaN 53 NaN 56 NaN 59 NaN 61 NaN 66 13.0
This suggests to me that the first ‘if’ statement in my function is working fine, but that all of the ‘elif’ statements aren’t triggering and I can’t work out why. I’m wondering whether it’s because I need to be more explicit about why they would trigger, rather than just relying on a logical test of ‘if the column is not, not equal to NaN, go to the next one’, or if I’m misunderstanding the elif aspect all together. I’ve put the whole segment of code in also, just because when I’ve run into issues so far the problem has often not been where I originally thought.
By all means tell me if you think I’ve gone about this in a weird way – this just seemed like a logical approach to the problem but open to other ways of getting the desired result.
Thanks in advance!
Advertisement
Answer
Use DataFrame.filter
with regex parameter with ^
for start of strings, then forward filling missing values and last select last value by position:
#for test one value in list stats_list = ['GP'] for column in stats_list: df[f'{column} Last'] = df.filter(regex=rf'^{column}').ffill(axis=1).iloc[:, -1] print (df) GP 18 GP 19 GP 20 GP 21 GP 22 GP 23 GP Last 50 14.0 13.0 14.0 NaN NaN NaN 14.0 51 14.0 14.0 14.0 NaN NaN NaN 14.0 53 13.0 12.0 11.0 NaN NaN NaN 11.0 56 10.0 13.0 9.0 13.0 NaN NaN 13.0 59 10.0 13.0 15.0 NaN NaN NaN 15.0 61 NaN NaN 11.0 11.0 NaN NaN 11.0 66 NaN 12.0 13.0 12.0 2.0 13.0 13.0
Another alternative solution with Series.str.startswith
for select columns:
stats_list = ['GP'] for column in stats_list: mask = df.columns.str.startswith(column) df[f'{column} Last'] = df.loc[:, mask].ffill(axis=1).iloc[:, -1]