Skip to content
Advertisement

Merge lists in a dataframe column if they share a common value

What I need:

I have a dataframe where the elements of a column are lists. There are no duplications of elements in a list. For example, a dataframe like the following:

import pandas as pd

>>d = {'col1': [[1, 2, 4, 8], [15, 16, 17], [18, 3], [2, 19], [10, 4]]}
>>df = pd.DataFrame(data=d)

           col1
0  [1, 2, 4, 8]
1  [15, 16, 17]
2       [18, 3]
3       [2, 19]
4       [10, 4]

I would like to obtain a dataframe where, if at least a number contained in a list at row i is also contained in a list at row j, then the two list are merged (without duplication). But the values could also be shared by more than two lists, in that case I want all lists that share at least a value to be merged.

                   col1
0  [1, 2, 4, 8, 19, 10]
1          [15, 16, 17]
2               [18, 3]

The order of the rows of the output dataframe, nor the values inside a list is important.


What I tried:

I have found this answer, that shows how to tell if at least one item in list is contained in another list, e.g.

>>not set([1, 2, 4, 8]).isdisjoint([2, 19])
True

Returns True, since 2 is contained in both lists.

I have also found this useful answer that shows how to compare each row of a dataframe with each other. The answer applies a custom function to each row of the dataframe using a lambda.

df.apply(lambda row: func(row['col1']), axis=1)

However I’m not sure how to put this two things together, how to create the func method. Also I don’t know if this approach is even feasible since the resulting rows will probably be less than the ones of the original dataframe.

Thanks!

Advertisement

Answer

This is not straightforward. Merging lists has many pitfalls.

One solid approach is to use a specialized library, for example networkx to use a graph approach. You can generate successive edges and find the connected components.

Here is your graph:

networkx graph list merging

You can thus:

  • generate successive edges with add_edges_from
  • find the connected_components
  • craft a dictionary and map the first item of each list
  • groupby and merge the lists (you could use the connected components directly but I’m giving a pandas solution in case you have more columns to handle)
import networkx as nx

G = nx.Graph()
for l in df['col1']:
    G.add_edges_from(zip(l, l[1:]))

groups = {k:v for v,l in enumerate(nx.connected_components(G)) for k in l}
# {1: 0, 2: 0, 4: 0, 8: 0, 10: 0, 19: 0, 16: 1, 17: 1, 15: 1, 18: 2, 3: 2}

out = (df.groupby(df['col1'].str[0].map(groups), as_index=False)
         .agg(lambda x: sorted(set().union(*x)))
       )

output:

                   col1
0  [1, 2, 4, 8, 10, 19]
1          [15, 16, 17]
2               [3, 18]
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement