Converting a dataframe with a line separator

Question

I make a function that accepts a dataframe as input: And returns a dataframe, where a certain delimiter number (in the example, it is 6) is the passed parameter: Here's what I got: How can I simplify the function and make it more versatile? How do I make the function faster? Thanks. Answer You can do this concisely with np.split()

Accepted Answer

You can do this concisely with np.split() and df.explode():sep = 6df.array = df.array.apply(lambda a:    np.split(a, 1 + np.where(np.array(a) == sep)[0][:-1]))df = df.set_index('string').explode('array').reset_index()#   string               array# 0    xxx  [1, 2, 3, 4, 5, 6]# 1    xxx        [1, 2, 3, 6]# 2    xxx                 [6]# 3    xxx     [2, 2, 3, 5, 6]# 4    yyy              [2, 6]# 5    yyy                 [6]Explanation for np.split() and np.where()We use np.where() to find the indexes of sep:a = [1, 2, 3, 4, 5, 6, 1, 2, 3, 6, 6, 2, 2, 3, 5, 6]sep = 6np.where(np.array(a) == sep)[0]# array([ 5,  9, 10, 15])However, np.split() does the splitting after each index, which puts sep at the beginning of each split:np.split(a, np.where(np.array(a) == sep)[0])# [array([1, 2, 3, 4, 5]),#  array([6, 1, 2, 3]),#  array([6]),#  array([6, 2, 2, 3, 5]),#  array([6])]Instead, OP wants to split before each index to keep sep at the end of each split, so we shift the splitting indexes (1 +) and remove the last splitting index which won&#8217;t exist anymore ([:-1]):np.split(a, 1 + np.where(np.array(a) == sep)[0][:-1])# [array([1, 2, 3, 4, 5, 6]),#  array([1, 2, 3, 6]),#  array([6]),#  array([2, 2, 3, 5, 6])]

Converting a dataframe with a line separator

Advertisement

Answer

Explanation for `np.split()` and `np.where()`