I have a set of attributes A= {a1, a2, ...an}
and a set of clusters C = {c1, c2, ... ck}
and I have a set of correspondences COR
which is a subset of A x C
and |COR|<< A x C
. Here is a sample set of correspondences
COR = {(a1, c1), (a1, c2), (a2, c1), (a3, c3), (a4, c4)}
Now, I want to generate all the subsets of COR
such that each pair in the subset represents an injective function from set A
to set C
. Let’s call each of such subset a mapping then the valid mappings from the above set COR
would be
m1 = {(a1, c1), (a3, c3), (a4, c4)}
and m2 = {(a1, c2), (a2, c1), (a3, c3), (a4, c4)}
m1
is interesting here because adding any of the remaining elements from COR
to m1
would either violate the definition of the function or it would violate the condition of being an injective function. For instance, if we add the pair (a1,c2)
to m1
, m1
would not be a function anymore and if we add (a2,c1)
to m1
, it will cease to be an injective function. So, I am interested in some code snippets or algorithm that I can use to generate all such mappings. Here is what I have tried so far in python
import collections
import itertools
corr = set({('a1', 'c1'), ('a1', 'c2'), ('a2', 'c1'), ('a3', 'c3'), ('a4', 'c4')}) clusters = [c[1] for c in corr] attribs = [a[0] for a in corr] rep_clusters = [item for item, count in collections.Counter(clusters).items() if count>1] rep_attribs = [item for item, count in collections.Counter(attribs).items() if count>1] conflicting_sets = [] for c in rep_clusters: conflicting_sets.append([p for p in corr if p[1] == c]) for a in rep_attribs: conflicting_sets.append([p for p in corr if p[0] == a]) non_conflicting = corr for s in conflicting_sets: non_conflicting = non_conflicting - set(s) m = set() for p in itertools.product(*conflicting_sets): print(p, 'product', len(p)) p_attribs = set([k[0] for k in p]) p_clusters = set([k[1] for k in p]) print(len(p_attribs), len(p_clusters)) if len(p) == len(p_attribs) and len(p) == len(p_clusters): m.add(frozenset(set(p).union(non_conflicting))) print(m)
And as expected the code produces m2
but not m1
because m1
will not be generated from itertools.product
. Can anyone guide me on this? I would also like some guidance on performance because the actual sets would be larger than COR
set used here and may contain many more conflicting sets.
Advertisement
Answer
A simpler definition of your requirements is:
- You have a set of unique tuples.
- You want to generate all subsets for which:
- all of the first elements of the tuples are unique (to ensure a function);
- and all of the second elements are unique (to ensure injectivity).
- Your title suggests you only want the maximal subsets, i.e. it must be impossible to add any additional elements from the original set without breaking the other requirements.
I’m also assuming any a<x>
or c<y>
is unique.
Here’s a solution:
def get_maximal_subsets(corr): def is_injective_function(f): if not f: return False f_domain, f_range = zip(*f) return len(set(f_domain)) - len(f_domain) + len(set(f_range)) - len(f_range) == 0 def generate_from(f): if is_injective_function(f): for r in corr - f: if is_injective_function(f | {r}): break else: yield f else: for c in f: yield from generate_from(f - {c}) return list(map(set, set(map(frozenset, generate_from(corr))))) # representing a's and c's as strings, as their actual value doesn't matter, as long as they are unique print(get_maximal_subsets(corr={('a1', 'c1'), ('a1', 'c2'), ('a2', 'c1'), ('a3', 'c3'), ('a4', 'c4')}))
The test is_injective_function
checks if the provided set f
represents a valid injective function, by getting all the values from the domain and range of the function and checking that both only contain unique values.
The generator takes an f
, and if it represents an injective valid function, it checks to see that none of the elements that have been removed from the original corr
to reach f
can be added back in while still having it represent an injective valid function. If that’s the case, it yields f
as a valid result.
If f
isn’t an injective valid function to begin with, it will try to remove each of the elements in f
in turn and generate any injective valid functions from each of those subsets.
Finally, the whole function removes duplicates from the resulting generator and returns it as a list of unique sets.
Output:
[{('a1', 'c1'), ('a3', 'c3'), ('a4', 'c4')}, {('a2', 'c1'), ('a3', 'c3'), ('a4', 'c4'), ('a1', 'c2')}]
Note, there’s several approaches to deduplicating a list of non-hashable values, but this approach turns all the sets in the list into a frozenset
to make them hashable, then turns the list into a set to remove duplicates, then turns the contents into sets again and returns the result as a list.
You can prevent removing duplicates at the end by keeping track of what removed subsets have already been tried, which may perform better depending on your actual data set:
def get_maximal_subsets(corr): def is_injective_function(f): if not f: return False f_domain, f_range = zip(*f) return len(set(f_domain)) - len(f_domain) + len(set(f_range)) - len(f_range) == 0 previously_removed = [] def generate_from(f, removed: set = None): previously_removed.append(removed) if removed is None: removed = set() if is_injective_function(f): for r in removed: if is_injective_function(f | {r}): break else: yield f else: for c in f: if removed | {c} not in previously_removed: yield from generate_from(f - {c}, removed | {c}) return list(generate_from(corr))
This is probably a generally better performing solution, but I liked the clean algorithm of the first one better for explanation.
I was annoyed by the slowness of the above solution after the comment asking whether it scales up to 100 elements with ~15 conflicts (it would run for many minutes to solve it), so here’s a faster solution that runs under 1 second for 100 elements with 15 conflicts, although the execution time still goes up exponentially, so it has its limits):
def injective_function_conflicts(f): if not f: return {} conflicts = defaultdict(set) # loop over the product f x f for x in f: for y in f: # for each x and y that have a conflict in any position if x != y and any(a == b for a, b in zip(x, y)): # add x to y's entry and y to x's entry conflicts[y].add(x) conflicts[x].add(y) return conflicts def get_maximal_partial_subsets(conflicts, off_limits: set = None): if off_limits is None: off_limits = set() while True and conflicts: # pop elements from the conflicts, using them now, or discarding them if off-limits k, vs = conflicts.popitem() if k not in off_limits: break else: # nothing left in conflicts that's not off-limits yield set() return # generate each possible result from the rest of the conflicts, adding the conflicts vs for k to off_limits for sub_result in get_maximal_partial_subsets(dict(conflicts), off_limits | vs): # these results can have k added to them, as all the conflicts with k were off-limits yield sub_result | {k} # also generated each possible result from the rest of the conflicts without k's conflicts for sub_result in get_maximal_partial_subsets(conflicts, off_limits): # but only yield as a result if adding k itself to it would actually cause a conflict, avoiding duplicates if sub_result and injective_function_conflicts(sub_result | {k}): yield sub_result def efficient_get_maximal_subsets(corr): conflicts = injective_function_conflicts(corr) final_result = list((corr - set(conflicts.keys())) | result for result in get_maximal_partial_subsets(dict(conflicts))) print(f'size of result and conflict: {len(final_result)}, {len(conflicts)}') return final_result print(efficient_get_maximal_subsets(corr={('a1', 'c1'), ('a1', 'c2'), ('a2', 'c1'), ('a3', 'c3'), ('a4', 'c4')}))