Given a string, e.g. i am a string
.
I can generate the n-grams of this string like so, using the nltk
package, where n is variable as per a specified range.
from nltk import ngrams s = 'i am a string' for n in range(1, 3): for grams in ngrams(s.split(), n): print(grams)
Gives the output:
('i',) ('am',) ('a',) ('string',) ('i', 'am') ('am', 'a') ('a', 'string')
Is there a way to ‘reconstruct’ the original string using combinations of the generated ngrams? Or, in the words of the below commenter, is there a way to divide the sentence into consecutive word sequences where each sequence has a maximum length of k (in this case k is 2).
[('i'), ('am'), ('a'), ('string')] [('i', 'am'), ('a'), ('string')] [('i'), ('am', 'a'), ('string')] [('i'), ('am'), ('a', 'string')] [('i', 'am'), ('a', 'string')]
The question is similar to this one, though with an additional layer of complexity.
Working solution – adapted from here.
I have a working solution, but it’s really slow for longer strings.
def get_ngrams(s, min_=1, max_=4): token_lst = [] for n in range(min_, max_): for idx, grams in enumerate(ngrams(s.split(), n)): token_lst.append(' '.join(grams)) return token_lst def to_sum_k(s): for len_ in range(1, len(s.split())+1): for i in itertools.permutations(get_ngrams(s), r=len_): if ' '.join(i) == s: print(i) to_sum_k('a b c')
Advertisement
Answer
While my previous answer assumed that the problem was to find an unknown string based on it’s ngrams, this answer will deal with the problem of finding all ways to construct a given string using it’s ngrams.
Assuming repetitions are allowed the solution is fairly simple: Generate all possible number sequences summing up to the length of the original string with no number larger than n
and use these to create the ngram-combinations:
import numpy def generate_sums(l, n, intermediate): if l == 0: yield intermediate elif l < 0: return else: for i in range(1, n + 1): yield from generate_sums(l - i, n, intermediate + [i]) def combinations(s, n): words = s.split(' ') for c in generate_sums(len(words), n, [0]): cs = numpy.cumsum(c) yield [words[l:u] for (l, u) in zip(cs, cs[1:])]
EDIT:
As pointed out by @norok2 (thanks for the work) in the comments, it seems to be faster to use alternative cumsum-implementations instead of the one provided by numpy
for this usecase.
END EDIT
If repetitions are not allowed things become a little bit more tricky. In this case we can use a non-deterministic finite automaton as defined in my previous answer and build our sequences based on traversals of the automaton:
def build_state_machine(s, n): next_state = 1 transitions = {} for ng in ngrams(s.split(' '), n): state = 0 for word in ng: if (state, word) not in transitions: transitions[(state, word)] = next_state next_state += 1 state = transitions[(state, word)] return transitions def combinations(s, n): transitions = build_state_machine(s, n) states = [(0, set(), [], [])] for word in s.split(' '): new_states = [] for state, term_visited, path, cur_elem in states: if state not in term_visited: new_states.append((0, term_visited.union(state), path + [tuple(cur_elem)], [])) if (state, word) in transitions: new_states.append((transitions[(state, word)], term_visited, path, cur_elem + [word])) states = new_states return [path + [tuple(cur_elem)] if state != 0 else path for (state, term_visited, path, cur_elem) in states if state not in term_visited]
As an example the following state machine would be generated for the string "a b a"
:
Red connections indicate a switch to the next ngram and need to be handled separately (second if in the loop), since they can only be traversed once.