Skip to content
Advertisement

sentence segmentation within a dictionary using spacy dependency parse

I have a TMX file containing source and target segments. Some of these segments are made up of several sentences. My goal is to segment these multi-sentence segments so that the entire TMX file consists of single-sentence segments.

I intend to use spacy’s dependency parser to segment these multi-sentence segments.

To achieve this, I have extracted the source and target segments using the Translate Toolkit package.

I then added the source and target segments to a dictionary (seg_dic). Next I converted these segments into nlp doc objects and again stored them in a dictionary (doc_dic). I now want to segment any multi-sentence segments using spacy’s dependency parser …

JavaScript

… but I don’t know how I can do this with the segments being stored in a dictionary.

This is what I have so far:

JavaScript

Can anyone explain how I can proceed from here? How can I iterate over my dictionary keys and values using the “for sent in doc.sents” logic?

Advertisement

Answer

The solution here is that you shouldn’t put your stuff in a dictionary like that – use a list. Maybe something like this.

JavaScript

The hard part of this will be what to do when the number of sentences don’t line up. Also note that I would be cautious about your conversion in the first place, as it’s possible a translator did things wholistically, so even if the sentences line up by number there’s no guarantee the first DE corresponds to the first EN, for example.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement