It is easy to create (or load) a DataFrame with something like an object-typed column, as so:
JavaScript
x
13
13
1
[In]: pdf = pd.DataFrame({
2
"a": [1, 2, 3],
3
"b": [4, 5, 6],
4
"c": [7, 8, 9],
5
"combined": [[1, 4, 7], [2, 5, 8], [3, 6, 9]]}
6
)
7
8
[Out]
9
a b c combined
10
0 1 4 7 [1, 4, 7]
11
1 2 5 8 [2, 5, 8]
12
2 3 6 9 [3, 6, 9]
13
I am currently in the position where I have, as separate columns, values that I am required to return as a single column, and need to do so quite efficiently. Is there a fast and efficient way to combine columns into a single object-type column?
In the example above, this would mean already having columns a
, b
, and c
, and I wish to create combined
.
I failed to find a similar example of question online, feel free to link if this is a duplicate.
Advertisement
Answer
Using numpy on large data it is much faster than rest
Update — numpy with list comprehension is much faster takes only 0.77s
JavaScript
1
3
1
pdf['combined'] = [x for x in pdf[['a', 'b', 'c']].to_numpy()]
2
# pdf['combined'] = pdf[['a', 'b', 'c']].to_numpy().tolist()
3
Comparision of speed
JavaScript
1
33
33
1
import pandas as pd
2
import sys
3
import time
4
5
def f1():
6
pdf = pd.DataFrame({"a": [1, 2, 3]*1000000, "b": [4, 5, 6]*1000000, "c": [7, 8, 9]*1000000})
7
s0 = time.time()
8
pdf.assign(combined=pdf.agg(list, axis=1))
9
print(time.time() - s0)
10
11
def f2():
12
pdf = pd.DataFrame({"a": [1, 2, 3]*1000000, "b": [4, 5, 6]*1000000, "c": [7, 8, 9]*1000000})
13
s0 = time.time()
14
pdf['combined'] = [x for x in pdf[['a', 'b', 'c']].to_numpy()]
15
# pdf['combined'] = pdf[['a', 'b', 'c']].to_numpy().tolist()
16
print(time.time() - s0)
17
18
def f3():
19
pdf = pd.DataFrame({"a": [1, 2, 3]*1000000, "b": [4, 5, 6]*1000000, "c": [7, 8, 9]*1000000})
20
s0 = time.time()
21
cols = ['a', 'b', 'c']
22
pdf['combined'] = pdf[cols].apply(lambda row: list(row.values), axis=1)
23
print(time.time() - s0)
24
25
def f4():
26
pdf = pd.DataFrame({"a": [1, 2, 3]*1000000, "b": [4, 5, 6]*1000000, "c": [7, 8, 9]*1000000})
27
s0 = time.time()
28
pdf["combined"] = pdf.apply(pd.Series.tolist,axis=1)
29
print(time.time() - s0)
30
31
if __name__ == '__main__':
32
eval(f'{sys.argv[1]}()')
33
JavaScript
1
9
1
➜ python test.py f1
2
17.766116857528687
3
➜ python test.py f2
4
0.7762737274169922
5
➜ python test.py f3
6
14.403311252593994
7
➜ python test.py f4
8
12.631694078445435
9