Skip to content
Advertisement

Divide into groups according to the specified attribute

I need to group the data in such a way that if the difference between the adjacent values from column a1 was equal to the same pre-specified value, then they belong to the same group. If the value between two adjacent elements is different, then all subsequent data belong to a different group. For example, I have such a data table

import pandas as pd
import numpy as np

data = [
    [5, 2],
    [100, 23],
    [101, -2],
    [303, 9],
    [304, 4],
    [709, 14],
    [710, 3],
    [711, 3],
    [988, 21]
]
columns = ['a1', 'a2']
df = pd.DataFrame(data=data, columns=columns)

If the difference between the elements of column a1 is equal to one, then they belong to the same group and the answer in this example will be the following:

[[0], [1, 2], [3, 4], [5, 6, 7], [8]]

The output list stores indexes that correspond to rows from df.

It may also be useful that column a1 is ordered. Thank you for your help!

Advertisement

Answer

Assuming that your data frame is sorted by a1 and that I understood your problem correctly, I think you could do something like this:

import pandas as pd
import numpy as np
from numba import njit

data = [
    [5, 2],
    [100, 23],
    [101, -2],
    [303, 9],
    [304, 4],
    [709, 14],
    [710, 3],
    [711, 3],
    [988, 21]
]
columns = ['a1', 'a2']
df = pd.DataFrame(data=data, columns=columns)

@njit
def get_groups(vals):
    counter = 0
    group = []
    for i in range(len(vals)-1):
        if vals[i+1]-vals[i] == 1:
            group.append(counter)
        else:
            group.append(counter)
            counter += 1
    if vals[-1] - vals[-2] == 1: group.append(group[-1])
    else: group.append(counter + 1)
        
    return group  
    
groups = get_groups(df['a1'].values)
assert len(groups) == len(df)

df['group'] = groups
final_ls = df.reset_index().groupby(['group']).agg({'index': list})['index'].to_list()
final_ls

------------------------------------------------------------
[[0], [1, 2], [3, 4], [5, 6, 7], [8]]
------------------------------------------------------------

The njit decorator from numba makes the looping approach efficient.

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement