I need to group the data in such a way that if the difference between the adjacent values from column a1 was equal to the same pre-specified value, then they belong to the same group. If the value between two adjacent elements is different, then all subsequent data belong to a different group. For example, I have such a data table
import pandas as pd import numpy as np data = [ [5, 2], [100, 23], [101, -2], [303, 9], [304, 4], [709, 14], [710, 3], [711, 3], [988, 21] ] columns = ['a1', 'a2'] df = pd.DataFrame(data=data, columns=columns)
If the difference between the elements of column a1 is equal to one, then they belong to the same group and the answer in this example will be the following:
[[0], [1, 2], [3, 4], [5, 6, 7], [8]]
The output list stores indexes that correspond to rows from df.
It may also be useful that column a1 is ordered. Thank you for your help!
Advertisement
Answer
Assuming that your data frame is sorted by a1
and that I understood your problem correctly, I think you could do something like this:
import pandas as pd import numpy as np from numba import njit data = [ [5, 2], [100, 23], [101, -2], [303, 9], [304, 4], [709, 14], [710, 3], [711, 3], [988, 21] ] columns = ['a1', 'a2'] df = pd.DataFrame(data=data, columns=columns) @njit def get_groups(vals): counter = 0 group = [] for i in range(len(vals)-1): if vals[i+1]-vals[i] == 1: group.append(counter) else: group.append(counter) counter += 1 if vals[-1] - vals[-2] == 1: group.append(group[-1]) else: group.append(counter + 1) return group groups = get_groups(df['a1'].values) assert len(groups) == len(df) df['group'] = groups final_ls = df.reset_index().groupby(['group']).agg({'index': list})['index'].to_list() final_ls ------------------------------------------------------------ [[0], [1, 2], [3, 4], [5, 6, 7], [8]] ------------------------------------------------------------
The njit
decorator from numba
makes the looping approach efficient.