sentence segmentation within a dictionary using spacy dependency parse

Question

I have a TMX file containing source and target segments. Some of these segments are made up of several sentences. My goal is to segment these multi-sentence segments so that the entire TMX file consists of single-sentence segments. I intend to use spacy's dependency parser to segment these multi-sentence segments. To achieve this, I have extracted the source and target

Accepted Answer

The solution here is that you shouldn&#8217;t put your stuff in a dictionary like that &#8211; use a list. Maybe something like this.import spacyfrom translate.storage.tmx import tmxfilewith open("./files/NTA_test.tmx", 'rb') as fin:    tmx_file = tmxfile(fin, 'de-DE', 'en-GB')de = spacy.load("de_core_news_lg")en = spacy.load("en_core_web_lg")out = []for node in tmx_file.unit_iter():    de_sents = list(de(node.source).sents)    en_sents = list(en(node.target).sents)    assert len(de_sents) == len(en_sents), "Different number of sentences!"        for desent, ensent in zip(de_sents, en_sents):        out.append( (desent, ensent) )The hard part of this will be what to do when the number of sentences don&#8217;t line up. Also note that I would be cautious about your conversion in the first place, as it&#8217;s possible a translator did things wholistically, so even if the sentences line up by number there&#8217;s no guarantee the first DE corresponds to the first EN, for example.

Advertisement

Answer