Generating Maximal Subsets of a Set Under Some Constraints in Python

Question

I have a set of attributes A= {a1, a2, ...an} and a set of clusters C = {c1, c2, ... ck} and I have a set of correspondences COR which is a subset of A x C and |COR|<< A x C. Here is a sample set of correspondences COR = {(a1, c1), (a1, c2), (a2, c1), (a3, c3), (a4,

Accepted Answer

A simpler definition of your requirements is:You have a set of unique tuples.You want to generate all subsets for which:all of the first elements of the tuples are unique (to ensure a function);and all of the second elements are unique (to ensure injectivity).Your title suggests you only want the maximal subsets, i.e. it must be impossible to add any additional elements from the original set without breaking the other requirements.I’m also assuming any a or c is unique.Here’s a solution:def get_maximal_subsets(corr): def is_injective_function(f): if not f: return False f_domain, f_range = zip(*f) return len(set(f_domain)) - len(f_domain) + len(set(f_range)) - len(f_range) == 0 def generate_from(f): if is_injective_function(f): for r in corr - f: if is_injective_function(f | {r}): break else: yield f else: for c in f: yield from generate_from(f - {c}) return list(map(set, set(map(frozenset, generate_from(corr)))))# representing a's and c's as strings, as their actual value doesn't matter, as long as they are uniqueprint(get_maximal_subsets(corr={('a1', 'c1'), ('a1', 'c2'), ('a2', 'c1'), ('a3', 'c3'), ('a4', 'c4')}))The test is_injective_function checks if the provided set f represents a valid injective function, by getting all the values from the domain and range of the function and checking that both only contain unique values.The generator takes an f, and if it represents an injective valid function, it checks to see that none of the elements that have been removed from the original corr to reach f can be added back in while still having it represent an injective valid function. If that’s the case, it yields f as a valid result.If f isn’t an injective valid function to begin with, it will try to remove each of the elements in f in turn and generate any injective valid functions from each of those subsets.Finally, the whole function removes duplicates from the resulting generator and returns it as a list of unique sets.Output:[{('a1', 'c1'), ('a3', 'c3'), ('a4', 'c4')}, {('a2', 'c1'), ('a3', 'c3'), ('a4', 'c4'), ('a1', 'c2')}]Note, there’s several approaches to deduplicating a list of non-hashable values, but this approach turns all the sets in the list into a frozenset to make them hashable, then turns the list into a set to remove duplicates, then turns the contents into sets again and returns the result as a list.You can prevent removing duplicates at the end by keeping track of what removed subsets have already been tried, which may perform better depending on your actual data set:def get_maximal_subsets(corr): def is_injective_function(f): if not f: return False f_domain, f_range = zip(*f) return len(set(f_domain)) - len(f_domain) + len(set(f_range)) - len(f_range) == 0 previously_removed = [] def generate_from(f, removed: set = None): previously_removed.append(removed) if removed is None: removed = set() if is_injective_function(f): for r in removed: if is_injective_function(f | {r}): break else: yield f else: for c in f: if removed | {c} not in previously_removed: yield from generate_from(f - {c}, removed | {c}) return list(generate_from(corr))This is probably a generally better performing solution, but I liked the clean algorithm of the first one better for explanation.I was annoyed by the slowness of the above solution after the comment asking whether it scales up to 100 elements with ~15 conflicts (it would run for many minutes to solve it), so here’s a faster solution that runs under 1 second for 100 elements with 15 conflicts, although the execution time still goes up exponentially, so it has its limits):def injective_function_conflicts(f): if not f: return {} conflicts = defaultdict(set) # loop over the product f x f for x in f: for y in f: # for each x and y that have a conflict in any position if x != y and any(a == b for a, b in zip(x, y)): # add x to y's entry and y to x's entry conflicts[y].add(x) conflicts[x].add(y) return conflictsdef get_maximal_partial_subsets(conflicts, off_limits: set = None): if off_limits is None: off_limits = set() while True and conflicts: # pop elements from the conflicts, using them now, or discarding them if off-limits k, vs = conflicts.popitem() if k not in off_limits: break else: # nothing left in conflicts that's not off-limits yield set() return # generate each possible result from the rest of the conflicts, adding the conflicts vs for k to off_limits for sub_result in get_maximal_partial_subsets(dict(conflicts), off_limits | vs): # these results can have k added to them, as all the conflicts with k were off-limits yield sub_result | {k} # also generated each possible result from the rest of the conflicts without k's conflicts for sub_result in get_maximal_partial_subsets(conflicts, off_limits): # but only yield as a result if adding k itself to it would actually cause a conflict, avoiding duplicates if sub_result and injective_function_conflicts(sub_result | {k}): yield sub_resultdef efficient_get_maximal_subsets(corr): conflicts = injective_function_conflicts(corr) final_result = list((corr - set(conflicts.keys())) | result for result in get_maximal_partial_subsets(dict(conflicts))) print(f'size of result and conflict: {len(final_result)}, {len(conflicts)}') return final_resultprint(efficient_get_maximal_subsets(corr={('a1', 'c1'), ('a1', 'c2'), ('a2', 'c1'), ('a3', 'c3'), ('a4', 'c4')}))

Advertisement

Answer