Skip to content
Advertisement

Reconstruct input string given ngrams of that string

Given a string, e.g. i am a string.

I can generate the n-grams of this string like so, using the nltk package, where n is variable as per a specified range.

from nltk import ngrams 

s = 'i am a string'
for n in range(1, 3):
    for grams in ngrams(s.split(), n):
        print(grams)

Gives the output:

('i',)
('am',)
('a',)
('string',)
('i', 'am')
('am', 'a')
('a', 'string')

Is there a way to ‘reconstruct’ the original string using combinations of the generated ngrams? Or, in the words of the below commenter, is there a way to divide the sentence into consecutive word sequences where each sequence has a maximum length of k (in this case k is 2).

[('i'), ('am'), ('a'), ('string')]
[('i', 'am'), ('a'), ('string')]
[('i'), ('am', 'a'), ('string')]
[('i'), ('am'), ('a', 'string')]
[('i', 'am'), ('a', 'string')]

The question is similar to this one, though with an additional layer of complexity.

Working solution – adapted from here.

I have a working solution, but it’s really slow for longer strings.

def get_ngrams(s, min_=1, max_=4):
    token_lst = []
    for n in range(min_, max_):
        for idx, grams in enumerate(ngrams(s.split(), n)):
            token_lst.append(' '.join(grams))
    return token_lst 

def to_sum_k(s):
    for len_ in range(1, len(s.split())+1):
        for i in itertools.permutations(get_ngrams(s), r=len_):
            if ' '.join(i) == s:
                print(i)

to_sum_k('a b c')

Advertisement

Answer

While my previous answer assumed that the problem was to find an unknown string based on it’s ngrams, this answer will deal with the problem of finding all ways to construct a given string using it’s ngrams.

Assuming repetitions are allowed the solution is fairly simple: Generate all possible number sequences summing up to the length of the original string with no number larger than n and use these to create the ngram-combinations:

import numpy

def generate_sums(l, n, intermediate):
    if l == 0:
        yield intermediate
    elif l < 0:
        return
    else:
        for i in range(1, n + 1):
            yield from generate_sums(l - i, n, intermediate + [i])

def combinations(s, n):
    words = s.split(' ')
    for c in generate_sums(len(words), n, [0]):
        cs = numpy.cumsum(c)
        yield [words[l:u] for (l, u) in zip(cs, cs[1:])]

EDIT:
As pointed out by @norok2 (thanks for the work) in the comments, it seems to be faster to use alternative cumsum-implementations instead of the one provided by numpy for this usecase.
END EDIT

If repetitions are not allowed things become a little bit more tricky. In this case we can use a non-deterministic finite automaton as defined in my previous answer and build our sequences based on traversals of the automaton:

def build_state_machine(s, n):
    next_state = 1
    transitions = {}
    for ng in ngrams(s.split(' '), n):
        state = 0
        for word in ng:
            if (state, word) not in transitions:
                transitions[(state, word)] = next_state
                next_state += 1

            state = transitions[(state, word)]

     return transitions

def combinations(s, n):
    transitions = build_state_machine(s, n)
    states = [(0, set(), [], [])]

    for word in s.split(' '):
        new_states = []
        for state, term_visited, path, cur_elem in states:
            if state not in term_visited:
                new_states.append((0, term_visited.union(state), path + [tuple(cur_elem)], []))
            if (state, word) in transitions:
                new_states.append((transitions[(state, word)], term_visited, path, cur_elem + [word]))

        states = new_states

   return [path + [tuple(cur_elem)] if state != 0 else path for (state, term_visited, path, cur_elem) in states if state not in term_visited]

As an example the following state machine would be generated for the string "a b a":

state machine

Red connections indicate a switch to the next ngram and need to be handled separately (second if in the loop), since they can only be traversed once.

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement