I have an input data and some transformation functions t1, t2, t3, t4, t5, t6. Each of them requires some columns as input and outputs some columns.
details = { 't1': { input_cols = ['C1', 'C2'], output_cols = ['C8', 'C9', 'C10'] } 't2': { input_cols = ['C8', 'C10'], output_cols = ['C11', 'C12'] } 't3': { input_cols = ['C11', 'C12'], output_cols = ['C13', 'C14', 'C15', 'C16'] }, 't4': { input_cols = ['C9', 'C3', 'C4', 'C5', 'C6'], output_cols = ['C17', 'C18', 'C19', 'C20'] }, 't5': { input_cols = ['C20', 'C16'], output_cols = ['C21', 'C22'] }, 't6': { input_cols = ['C7', 'C16'], output_cols = ['C23', 'C24', 'C25'] } }
The DAG associated with these transformations is
I get as input the columns I want to generate and I should return the sub-DAG required to obtain it (I am not interested only in the DAG nodes, the order also matters) For example,
if the required_columns
are
['C1', 'C2,', ..., 'C25']
, the output should be
['t1', 't2', 't3', 't5', 't4', 't6']
or other possible path such as ['t1', 't4', 't2', 't6', 't3', 't5']
if the required_columns
are [‘C8’, ‘C9’, ‘C10’, ‘C11’, ‘C12’, ‘C13’, ‘C114’, ‘C15’, ‘C16’, ‘C20’, ‘C21’, ‘C22’] I should output [‘t1’, ‘t2’, ‘t3’, ‘t5’]
if the required_columns
are [‘C1’, ‘C3’, ‘C5’, ‘C8’, ‘C9’, ‘C10’, ‘C11’, ‘C12’, ‘C17’, ‘C18’, ‘C19’, ‘C20’] I should output [‘t1’, ‘t4’]
Note that the presence / absence of ‘C1’ … ‘C7’ in my required_columns
doesn’t influence the desired output, as these columns are not generated by t1, t2, t3, t4, t5, t6
My approach for solving this problem would be:
- Create a DAG D based on
details
dictionary - For each leaf in D
if all the
output_cols
of the current leaf are not in therequired_columns
-> delete the node and move upwards to the nodes that point to the node and repeat the step - Print the values of the existing topological sorted trimmed DAG
I intuitively think my approach is far from optimal. Given the current input data format (details
is a dictionary and required_columns
is a list), how can I do better?
Advertisement
Answer
Build the transpose DAG and do depth-first search from the required columns?
import collections Transformation = collections.namedtuple("Transformation", ["input_cols", "output_cols"]) details = { "t1": Transformation(input_cols=["C1", "C2"], output_cols=["C8", "C9", "C10"]), "t2": Transformation(input_cols=["C8", "C10"], output_cols=["C11", "C12"]), "t3": Transformation( input_cols=["C11", "C12"], output_cols=["C13", "C14", "C15", "C16"] ), "t4": Transformation( input_cols=["C9", "C3", "C4", "C5", "C6"], output_cols=["C17", "C18", "C19", "C20"], ), "t5": Transformation(input_cols=["C20", "C16"], output_cols=["C21", "C22"]), "t6": Transformation(input_cols=["C7", "C16"], output_cols=["C23", "C24", "C25"]), } def depth_first_search(graph, roots): stack = [(root, True) for root in roots] visited = set() order = [] while stack: v, entering = stack.pop() if entering: if v in visited: continue visited.add(v) stack.append((v, False)) stack.extend((w, True) for w in graph[v]) else: order.append(v) return order def sub_dag(required_columns): graph = collections.defaultdict(set) for k, v in details.items(): for c in v.input_cols: graph[k].add(c) for c in v.output_cols: graph[c].add(k) order = depth_first_search(graph, required_columns) return [t for t in order if t in details] print( sub_dag( [ "C1", "C2", "C3", "C4", "C5", "C6", "C7", "C8", "C9", "C10", "C11", "C12", "C13", "C14", "C15", "C16", "C17", "C18", "C19", "C20", "C21", "C22", "C23", "C24", "C25", ] ) ) print( sub_dag( [ "C8", "C9", "C10", "C11", "C12", "C13", "C14", "C15", "C16", "C20", "C21", "C22", ] ) ) print( sub_dag( [ "C1", "C3", "C5", "C8", "C9", "C10", "C11", "C12", "C17", "C18", "C19", "C20", ] ) ) print(sub_dag(["C1", "C3", "C5", "C25", "C22"]))