What is the most pandastic way to create running total columns at various levels (without iterating over the rows)?
input:
import pandas as pd import numpy as np df = pd.DataFrame() df['test'] = np.nan,np.nan,'X','X','X','X',np.nan,'X','X','X','X','X','X',np.nan,np.nan,'X','X' df['desired_output_level_1'] = np.nan,np.nan,'1','1','1','1',np.nan,'2','2','2','2','2','2',np.nan,np.nan,'3','3' df['desired_output_level_2'] = np.nan,np.nan,'1','2','3','4',np.nan,'1','2','3','4','5','6',np.nan,np.nan,'1','2'
output:
test desired_output_level_1 desired_output_level_2 0 NaN NaN NaN 1 NaN NaN NaN 2 X 1 1 3 X 1 2 4 X 1 3 5 X 1 4 6 NaN NaN NaN 7 X 2 1 8 X 2 2 9 X 2 3 10 X 2 4 11 X 2 5 12 X 2 6 13 NaN NaN NaN 14 NaN NaN NaN 15 X 3 1 16 X 3 2
The test column can only contain X’s or NaNs. The number of consecutive X’s is random.
In the ‘desired_output_level_1’ column, trying to count up the number of series of X’s.
In the ‘desired_output_level_2’ column, trying to find the duration of each series.
Can anyone help? Thanks in advance.
Advertisement
Answer
Perhaps not the most pandastic way, but seems to yield what you are after.
Three key points:
- we are operating on only rows that are not NaN, so let’s create a mask:
mask = df['test'].notna()
- For level 1 computation, it’s easy to compare when there is a change from NaN to not NaN by shifting rows by one:
df.loc[mask, "level_1"] = (df["test"].isna() & df["test"].shift(-1).notna()).cumsum()
- For level 2 computation, it’s a bit trickier. One way to do it is to run the computation for each
level_1
group and do.transform
to preserve the indexing:
df.loc[mask, "level_2"] = ( df.loc[mask, ["level_1"]] .assign(level_2=1) .groupby("level_1")["level_2"] .transform("cumsum") )
Last step (if needed) is to transform columns to strings
:
df['level_1'] = df['level_1'].astype('Int64').astype('str') df['level_2'] = df['level_2'].astype('Int64').astype('str')