I make a function that accepts a dataframe as input:
JavaScript
x
7
1
a = {"string": ['xxx', 'yyy'], "array": [[1,2,3,4,5,6,1,2,3,6,6,2,2,3,5,6], [2,6,6]]}
2
df = pd.DataFrame(a)
3
4
string array
5
0 xxx [1, 2, 3, 4, 5, 6, 1, 2, 3, 6, 6, 2, 2, 3, 5, 6]
6
1 yyy [2, 6, 6]
7
And returns a dataframe, where a certain delimiter number (in the example, it is 6) is the passed parameter:
JavaScript
1
8
1
string array
2
0 xxx [1, 2, 3, 4, 5, 6]
3
1 xxx [1, 2, 3, 6]
4
2 xxx [6]
5
3 xxx [2, 2, 3, 5, 6]
6
4 yyy [2, 6]
7
5 yyy [6]
8
Here’s what I got:
JavaScript
1
28
28
1
def df_conversion(df, sep=None):
2
data = {}
3
idx = []
4
5
for i in range(df.shape[0]):
6
key = df['string'].iloc[i]
7
value = df['array'].iloc[i]
8
9
spl = [[]]
10
for item in value:
11
if item == sep:
12
spl[-1].append(item)
13
idx.append(key)
14
spl.append([])
15
else:
16
spl[-1].append(item)
17
18
del spl[-1]
19
if i == 0: spl_0 = spl
20
if i == 1: spl_0.extend(spl)
21
22
data['string'] = idx
23
data['array'] = spl_0
24
25
return pd.DataFrame(data)
26
27
df_conversion(df, 6)
28
How can I simplify the function and make it more versatile? How do I make the function faster? Thanks.
Advertisement
Answer
You can do this concisely with np.split()
and df.explode()
:
JavaScript
1
14
14
1
sep = 6
2
df.array = df.array.apply(lambda a:
3
np.split(a, 1 + np.where(np.array(a) == sep)[0][:-1]))
4
5
df = df.set_index('string').explode('array').reset_index()
6
7
# string array
8
# 0 xxx [1, 2, 3, 4, 5, 6]
9
# 1 xxx [1, 2, 3, 6]
10
# 2 xxx [6]
11
# 3 xxx [2, 2, 3, 5, 6]
12
# 4 yyy [2, 6]
13
# 5 yyy [6]
14
Explanation for np.split()
and np.where()
We use np.where()
to find the indexes of sep
:
JavaScript
1
6
1
a = [1, 2, 3, 4, 5, 6, 1, 2, 3, 6, 6, 2, 2, 3, 5, 6]
2
sep = 6
3
np.where(np.array(a) == sep)[0]
4
5
# array([ 5, 9, 10, 15])
6
However, np.split()
does the splitting after each index, which puts sep
at the beginning of each split:
JavaScript
1
8
1
np.split(a, np.where(np.array(a) == sep)[0])
2
3
# [array([1, 2, 3, 4, 5]),
4
# array([6, 1, 2, 3]),
5
# array([6]),
6
# array([6, 2, 2, 3, 5]),
7
# array([6])]
8
Instead, OP wants to split before each index to keep sep
at the end of each split, so we shift the splitting indexes (1 +
) and remove the last splitting index which won’t exist anymore ([:-1]
):
JavaScript
1
7
1
np.split(a, 1 + np.where(np.array(a) == sep)[0][:-1])
2
3
# [array([1, 2, 3, 4, 5, 6]),
4
# array([1, 2, 3, 6]),
5
# array([6]),
6
# array([2, 2, 3, 5, 6])]
7