Skip to content
Advertisement

How to efficiently combine multiple pandas columns into one array-like column?

It is easy to create (or load) a DataFrame with something like an object-typed column, as so:

[In]: pdf = pd.DataFrame({
                     "a": [1, 2, 3], 
                     "b": [4, 5, 6], 
                     "c": [7, 8, 9], 
                     "combined": [[1, 4, 7], [2, 5, 8], [3, 6, 9]]}
      )

[Out]
   a  b  c   combined
0  1  4  7  [1, 4, 7]
1  2  5  8  [2, 5, 8]
2  3  6  9  [3, 6, 9]

I am currently in the position where I have, as separate columns, values that I am required to return as a single column, and need to do so quite efficiently. Is there a fast and efficient way to combine columns into a single object-type column?

In the example above, this would mean already having columns a, b, and c, and I wish to create combined.

I failed to find a similar example of question online, feel free to link if this is a duplicate.

Advertisement

Answer

Using numpy on large data it is much faster than rest

Update — numpy with list comprehension is much faster takes only 0.77s

pdf['combined'] = [x for x in pdf[['a', 'b', 'c']].to_numpy()]
# pdf['combined'] = pdf[['a', 'b', 'c']].to_numpy().tolist()

Comparision of speed

import pandas as pd
import sys
import time

def f1():
    pdf = pd.DataFrame({"a": [1, 2, 3]*1000000,  "b": [4, 5, 6]*1000000,  "c": [7, 8, 9]*1000000})
    s0 = time.time()
    pdf.assign(combined=pdf.agg(list, axis=1))
    print(time.time() - s0)

def f2():
    pdf = pd.DataFrame({"a": [1, 2, 3]*1000000,  "b": [4, 5, 6]*1000000,  "c": [7, 8, 9]*1000000})
    s0 = time.time()
    pdf['combined'] = [x for x in pdf[['a', 'b', 'c']].to_numpy()]
    # pdf['combined'] = pdf[['a', 'b', 'c']].to_numpy().tolist()
    print(time.time() - s0)

def f3():
    pdf = pd.DataFrame({"a": [1, 2, 3]*1000000,  "b": [4, 5, 6]*1000000,  "c": [7, 8, 9]*1000000})
    s0 = time.time()
    cols = ['a', 'b', 'c']
    pdf['combined'] = pdf[cols].apply(lambda row: list(row.values), axis=1)
    print(time.time() - s0)

def f4():
    pdf = pd.DataFrame({"a": [1, 2, 3]*1000000,  "b": [4, 5, 6]*1000000,  "c": [7, 8, 9]*1000000})
    s0 = time.time()
    pdf["combined"] = pdf.apply(pd.Series.tolist,axis=1)
    print(time.time() - s0)

if __name__ == '__main__':
    eval(f'{sys.argv[1]}()')
➜   python test.py f1
17.766116857528687
➜   python test.py f2
0.7762737274169922
➜   python test.py f3
14.403311252593994
➜   python test.py f4
12.631694078445435
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement