Creating new columns within a dataframe, based on the latest value from previous columns

I’ve just completed a beginner’s course in python, so please bear with me if the code below doesn’t make sense or my issue is because of some rookie mistake.

I’ve been trying to put the learning to use by working with college production of NFL players, with a view to understanding which statistics can be predictive or at least correlate to NFL production. It turns out that there’s a lot of data out there so I have about 200 columns of data for 600 odd prospects from the last 20 years (just for running backs so far). However, one of the problems with this data is that each stat is only provided by the age the prospect was in that season giving me something like this:

    GP 18  GP 19  GP 20  GP 21  GP 22  GP 23
50   14.0   13.0   14.0    NaN    NaN    NaN
51   14.0   14.0   14.0    NaN    NaN    NaN
53   13.0   12.0   11.0    NaN    NaN    NaN
56   10.0   13.0    9.0   13.0    NaN    NaN
59   10.0   13.0   15.0    NaN    NaN    NaN
61    NaN    NaN   11.0   11.0    NaN    NaN
66    NaN   12.0   13.0   12.0    2.0   13.0

What I want to do at the moment is to be able to take the last year of college production and put it into a new column (for 17 different statistics). I’ve therefore defined the following function:

def grab_latest(row):
    if row[f'{column} 23'] != 'NaN':
        return row[f'{column} 23']
    elif row[f'{column} 22'] != 'NaN':
        return row[f'{column} 22']
    elif row[f'{column} 21'] != 'NaN':
        return row[f'{column} 21']
    elif row[f'{column} 20'] != 'NaN':
        return row[f'{column} 20']
    elif row[f'{column} 19'] != 'NaN':
        return row[f'{column} 19']
    elif row[f'{column} 18'] != 'NaN':
        return row[f'{column} 18']

Which I think should go backwards through the columns until I find a value which isn’t NaN, and then take that value as the output. I’ve then defined the columns via a list:

stats_list = ['% Yd Rec', 'PPR PPG', 'PPR', 'GP', 'Rush Att', 'Rush Yd', 'Rush TD', 'MS Rush Att', 'MS Rush Yd',
              'MS Rush TD', 'REC', 'REC Yd', 'REC TD', 'MS REC', 'MS Rec Yd', 'MS REC TD', 'TOT DOM + TD MS']

and have then run the function through a for loop based on this list:

for column in stats_list:
    RB_df[f'{column} Last'] = RB_df.apply(lambda row: grab_latest(row), axis=1)

The result I’m getting back is a slightly bizarre one – the for loop appears to work, as all the new columns I’m expecting are created, however they are only populated with data where the player had an age 23 season. The remainder of indexes are filled with ‘NaN’:

    GP Last
50      NaN
51      NaN
53      NaN
56      NaN
59      NaN
61      NaN
66     13.0

This suggests to me that the first ‘if’ statement in my function is working fine, but that all of the ‘elif’ statements aren’t triggering and I can’t work out why. I’m wondering whether it’s because I need to be more explicit about why they would trigger, rather than just relying on a logical test of ‘if the column is not, not equal to NaN, go to the next one’, or if I’m misunderstanding the elif aspect all together. I’ve put the whole segment of code in also, just because when I’ve run into issues so far the problem has often not been where I originally thought.

By all means tell me if you think I’ve gone about this in a weird way – this just seemed like a logical approach to the problem but open to other ways of getting the desired result.

Thanks in advance!

Answer

Use DataFrame.filter with regex parameter with ^ for start of strings, then forward filling missing values and last select last value by position:

#for test one value in list
stats_list = ['GP']
for column in stats_list:
    df[f'{column} Last'] = df.filter(regex=rf'^{column}').ffill(axis=1).iloc[:, -1]

print (df)
    GP 18  GP 19  GP 20  GP 21  GP 22  GP 23  GP Last
50   14.0   13.0   14.0    NaN    NaN    NaN     14.0
51   14.0   14.0   14.0    NaN    NaN    NaN     14.0
53   13.0   12.0   11.0    NaN    NaN    NaN     11.0
56   10.0   13.0    9.0   13.0    NaN    NaN     13.0
59   10.0   13.0   15.0    NaN    NaN    NaN     15.0
61    NaN    NaN   11.0   11.0    NaN    NaN     11.0
66    NaN   12.0   13.0   12.0    2.0   13.0     13.0

Another alternative solution with Series.str.startswith for select columns:

stats_list = ['GP']
for column in stats_list:
    mask = df.columns.str.startswith(column)
    df[f'{column} Last'] = df.loc[:, mask].ffill(axis=1).iloc[:, -1]

Advertisement

Answer