How to efficiently combine multiple pandas columns into one array-like column?

Question

It is easy to create (or load) a DataFrame with something like an object-typed column, as so: I am currently in the position where I have, as separate columns, values that I am required to return as a single column, and need to do so quite efficiently. Is there a fast and efficient way to combine columns into a single

Accepted Answer

Using numpy on large data it is much faster than restUpdate &#8212; numpy with list comprehension is much faster takes only 0.77spdf['combined'] = [x for x in pdf[['a', 'b', 'c']].to_numpy()]# pdf['combined'] = pdf[['a', 'b', 'c']].to_numpy().tolist()Comparision of speedimport pandas as pdimport sysimport timedef f1():    pdf = pd.DataFrame({"a": [1, 2, 3]*1000000,  "b": [4, 5, 6]*1000000,  "c": [7, 8, 9]*1000000})    s0 = time.time()    pdf.assign(combined=pdf.agg(list, axis=1))    print(time.time() - s0)def f2():    pdf = pd.DataFrame({"a": [1, 2, 3]*1000000,  "b": [4, 5, 6]*1000000,  "c": [7, 8, 9]*1000000})    s0 = time.time()    pdf['combined'] = [x for x in pdf[['a', 'b', 'c']].to_numpy()]    # pdf['combined'] = pdf[['a', 'b', 'c']].to_numpy().tolist()    print(time.time() - s0)def f3():    pdf = pd.DataFrame({"a": [1, 2, 3]*1000000,  "b": [4, 5, 6]*1000000,  "c": [7, 8, 9]*1000000})    s0 = time.time()    cols = ['a', 'b', 'c']    pdf['combined'] = pdf[cols].apply(lambda row: list(row.values), axis=1)    print(time.time() - s0)def f4():    pdf = pd.DataFrame({"a": [1, 2, 3]*1000000,  "b": [4, 5, 6]*1000000,  "c": [7, 8, 9]*1000000})    s0 = time.time()    pdf["combined"] = pdf.apply(pd.Series.tolist,axis=1)    print(time.time() - s0)if __name__ == '__main__':    eval(f'{sys.argv[1]}()')➜   python test.py f117.766116857528687➜   python test.py f20.7762737274169922➜   python test.py f314.403311252593994➜   python test.py f412.631694078445435

Advertisement

Answer