Skip to content
Advertisement

How do reduce a set of columns along another set of columns, holding all other columns?

I think this is a simple operation, but for some reason I’m not finding immediate indicators in my quick perusal of the Pandas docs.

I have prototype working code below, but it seems kinda dumb IMO. I’m sure that there are much better ways to do this, and concepts to describe it.

Is there a better way? If not, at least better way to describe?

Abstract Problem

Basically, I have columns p0, p1, y0, y1, .... ... are just things I’d like held constant (remain as separate in table). p0, p1 are things I’d like to reduce against. y0, y1 are columns I’d like to be reduced.

DataFrame.grouby didn’t seem like what I wanted. When perusing the code, I wasn’t sure if anything else was I wanted. Multi-indexing also seemed like a possible context, but I didn’t immediately see an example of what I desired.

Here’s the code that does I what I want:

def merge_into(*, from_, to_):
    for k, v in from_.items():
        to_[k] = v

def reduce_along(df, along_cols, reduce_cols, df_reduce=pd.DataFrame.mean):
    hold_cols = set(df.columns) - set(along_cols) - set(reduce_cols)
    # dumb way to remember Dict[HeldValues, ValuesToReduce]
    to_reduce_map = defaultdict(list)
    for i in range(len(df)):
        row = df.iloc[i]
        # can I instead use a series? is that hashable?
        key = tuple(row[hold_cols])
        to_reduce = row[reduce_cols]
        to_reduce_map[key].append(to_reduce)
    rows = []
    for key, to_reduce_list in to_reduce_map.items():
        # ... yuck?
        row = pd.Series({k: v for k, v in zip(hold_cols, key)})
        reduced = df_reduce(pd.DataFrame(to_reduce_list))
        merge_into(from_=reduced, to_=row)
        rows.append(row)
    return pd.DataFrame(rows)

reducto = reduce_along(summary, ["p0", "p1"], ["y0", "y1"])
display(reducto)

Background

I am running some sweeps for ML stuff; in it, I sweep on some model architecture param, as well as dataset size and the seed that controls random initialization of the model parameters.

I’d like to reduce along the seed to get a “feel” for what architectures are possibly more robust to initialization; for now, I’d like to see what dataset size helps the most. In the future, I’d like to do (heuristic) reduction along dataset size as well.

Advertisement

Answer

Actually, looks like DataFrame.groupby(hold_cols).agg({k: ["mean"] for k in reduce_cols}) is what I want. Source: https://jamesrledoux.com/code/group-by-aggregate-pandas

# See: https://stackoverflow.com/a/47699378/7829525
std = functools.partial(np.std)

def reduce_along(df, along_cols, reduce_cols, agg=[np.mean, std]):
    hold_cols = list(set(df.columns) - set(along_cols) - set(reduce_cols))
    hold_cols = [x for x in df.columns if x in hold_cols]  # Preserve order
    # From: https://jamesrledoux.com/code/group-by-aggregate-pandas
    df = df.groupby(hold_cols).agg({k: ag for k in reduce_cols})
    df = df.reset_index()
    return df
9 People found this is helpful
Advertisement