Skip to content
Advertisement

Conditionally merge lines in text file

I’ve a text file full of common misspellings and their corrections.

All misspellings, of the same intended word, should be on the same line.

I do have this somewhat done, but not for all misspellings of the same word.

misspellings_corpus.txt (snippet):

I'de->I'd
aple->apple
appl->apple
I'ed, I'ld, Id->I'd

Desired:

I'de, I'ed, I'ld, Id->I'd
aple, appl->apple

template: wrong1, wrong2, wrongN->correct


Attempt:

lines = []
with open('/content/drive/MyDrive/Colab Notebooks/misspellings_corpus.txt', 'r') as fin:
  lines = fin.readlines()

for this_idx, this_line in enumerate(lines):
  for comparison_idx, comparison_line in enumerate(lines):
    if this_idx != comparison_idx:
      if this_line.split('->')[1].strip() == comparison_line.split('->')[1].strip():
        #...
correct_words = [l.split('->')[1].strip() for l in lines]
correct_words

Advertisement

Answer

Store the correct spelling of your words as keys of a dictionary that maps to a set of possible misspellings of that word. The dict is intended for you to easilly find the word you’re trying to correct and the set is to avoid duplicates of the misspellings.

possible_misspellings = {}

with open('my-file.txt') as file:
  for line in file:
    misspellings, word = line.split('->')
    word = word.strip()
    misspellings = set(m.strip() for m in misspellings.split(','))

    if word in possible_misspellings:
      possible_misspellings[word].update(misspellings)
    else:
      possible_misspellings[word] = misspellings

Then you can iterate over your dictionary

with open('my-new-file.txt', 'w') as file:
  for word, misspellings in possible_misspellings.items():
    line = ','.join(misspellings) + '->' + word + 'n'
    file.write(line)
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement