Reconstruct input string given ngrams of that string

Question

Given a string, e.g. i am a string. I can generate the n-grams of this string like so, using the nltk package, where n is variable as per a specified range. Gives the output: Is there a way to &#8216;reconstruct&#8217; the original string using combinations of the generated ngrams? Or, in the words of the bel…

Accepted Answer

While my previous answer assumed that the problem was to find an unknown string based on it&#8217;s ngrams, this answer will deal with the problem of finding all ways to construct a given string using it&#8217;s ngrams.Assuming repetitions are allowed the solution is fairly simple: Generate all possible number sequences summing up to the length of the original string with no number larger than n and use these to create the ngram-combinations:import numpydef generate_sums(l, n, intermediate):    if l == 0:        yield intermediate    elif l < 0:        return    else:        for i in range(1, n + 1):            yield from generate_sums(l - i, n, intermediate + [i])def combinations(s, n):    words = s.split(' ')    for c in generate_sums(len(words), n, [0]):        cs = numpy.cumsum(c)        yield [words[l:u] for (l, u) in zip(cs, cs[1:])]EDIT:As pointed out by @norok2 (thanks for the work) in the comments, it seems to be faster to use alternative cumsum-implementations instead of the one provided by numpy for this usecase.END EDITIf repetitions are not allowed things become a little bit more tricky. In this case we can use a non-deterministic finite automaton as defined in my previous answer and build our sequences based on traversals of the automaton:def build_state_machine(s, n):    next_state = 1    transitions = {}    for ng in ngrams(s.split(' '), n):        state = 0        for word in ng:            if (state, word) not in transitions:                transitions[(state, word)] = next_state                next_state += 1            state = transitions[(state, word)]     return transitionsdef combinations(s, n):    transitions = build_state_machine(s, n)    states = [(0, set(), [], [])]    for word in s.split(' '):        new_states = []        for state, term_visited, path, cur_elem in states:            if state not in term_visited:                new_states.append((0, term_visited.union(state), path + [tuple(cur_elem)], []))            if (state, word) in transitions:                new_states.append((transitions[(state, word)], term_visited, path, cur_elem + [word]))        states = new_states   return [path + [tuple(cur_elem)] if state != 0 else path for (state, term_visited, path, cur_elem) in states if state not in term_visited]As an example the following state machine would be generated for the string "a b a":Red connections indicate a switch to the next ngram and need to be handled separately (second if in the loop), since they can only be traversed once.

Advertisement

Answer