I make a function that accepts a dataframe as input:
a = {"string": ['xxx', 'yyy'], "array": [[1,2,3,4,5,6,1,2,3,6,6,2,2,3,5,6], [2,6,6]]}
df = pd.DataFrame(a)
string array
0 xxx [1, 2, 3, 4, 5, 6, 1, 2, 3, 6, 6, 2, 2, 3, 5, 6]
1 yyy [2, 6, 6]
And returns a dataframe, where a certain delimiter number (in the example, it is 6) is the passed parameter:
string array 0 xxx [1, 2, 3, 4, 5, 6] 1 xxx [1, 2, 3, 6] 2 xxx [6] 3 xxx [2, 2, 3, 5, 6] 4 yyy [2, 6] 5 yyy [6]
Here’s what I got:
def df_conversion(df, sep=None):
data = {}
idx = []
for i in range(df.shape[0]):
key = df['string'].iloc[i]
value = df['array'].iloc[i]
spl = [[]]
for item in value:
if item == sep:
spl[-1].append(item)
idx.append(key)
spl.append([])
else:
spl[-1].append(item)
del spl[-1]
if i == 0: spl_0 = spl
if i == 1: spl_0.extend(spl)
data['string'] = idx
data['array'] = spl_0
return pd.DataFrame(data)
df_conversion(df, 6)
How can I simplify the function and make it more versatile? How do I make the function faster? Thanks.
Advertisement
Answer
You can do this concisely with np.split() and df.explode():
sep = 6
df.array = df.array.apply(lambda a:
np.split(a, 1 + np.where(np.array(a) == sep)[0][:-1]))
df = df.set_index('string').explode('array').reset_index()
# string array
# 0 xxx [1, 2, 3, 4, 5, 6]
# 1 xxx [1, 2, 3, 6]
# 2 xxx [6]
# 3 xxx [2, 2, 3, 5, 6]
# 4 yyy [2, 6]
# 5 yyy [6]
Explanation for np.split() and np.where()
We use np.where() to find the indexes of sep:
a = [1, 2, 3, 4, 5, 6, 1, 2, 3, 6, 6, 2, 2, 3, 5, 6] sep = 6 np.where(np.array(a) == sep)[0] # array([ 5, 9, 10, 15])
However, np.split() does the splitting after each index, which puts sep at the beginning of each split:
np.split(a, np.where(np.array(a) == sep)[0]) # [array([1, 2, 3, 4, 5]), # array([6, 1, 2, 3]), # array([6]), # array([6, 2, 2, 3, 5]), # array([6])]
Instead, OP wants to split before each index to keep sep at the end of each split, so we shift the splitting indexes (1 +) and remove the last splitting index which won’t exist anymore ([:-1]):
np.split(a, 1 + np.where(np.array(a) == sep)[0][:-1]) # [array([1, 2, 3, 4, 5, 6]), # array([1, 2, 3, 6]), # array([6]), # array([2, 2, 3, 5, 6])]