I’m performing 2 big for loop tasks on a dataframe
column. The context being what I’m calling “text corruption”; turning perfectly structured text into text full of both missing punctuation and misspellings, to mimic human errors.
I found that running 10,000s rows was extremely slow, even after optimizing the for loops.
I discovered a process called Batching, on this post.
The top answer provides a concise template that I imagine is much faster than regular for loop iterations.
How might I use that answer to reimplement the following code? (I added a comment to it asking more about it).
Or; might there be any technique that makes my for loops considerably quicker?
import pandas as pd import random import re # example df = pd.DataFrame(columns=['Forname', 'Surname', 'Sentence']) df.loc['0'] = ['Bob', 'Smith', 'Hi, this is a perfectly constructred sentence!'] df.loc['1'] = ['Alice', 'Smith', 'Can you tell this is fake data?'] df.loc['2'] = ['John', 'Smith', 'This poster needs help!'] df.loc['3'] = ['Michael', 'Smith', 'Apparently, this poster is sturggling a bit LOL'] df.loc['4'] = ['Daniel', 'Smith', 'More fake data here; ok.'] df.loc['5'] = ['Sarah', 'Smith', 'Will need to think up of better ideas.'] df.loc['6'] = ['Matthew', 'Smith', 'Love a good bit of Python, me.'] df.loc['7'] = ['Jane', 'Smith', 'Is this a sentence?! (I think so).'] df.loc['8'] = ['Peter', 'Smith', "Remarkable - isn't it?"] df.loc['9'] = ['Chloe', 'Smith', "Foo Bar... that's all that is left to say."] print(df) punctuation_marks = ['?', '…', '!', '.', ',', '—', '–', '–', ':', ';', '"', ''', '[', ']', '(', ')', '{', '}'] p = 0.5 # changeable for idx, string in enumerate(df['Sentence']): for punc in punctuation_marks: if punc in string: CHANCE = (random.randint(1, 100)) / 100 if CHANCE <= p: df['Sentence'][idx] = string.replace(punc, '') misspellings_corpus = open('misspellings_corpus.txt', 'r') misspellings = misspellings_corpus.readlines() for idx, string in enumerate(df['Sentence']): word_list = re.sub("[^w]", " ", string).split() # removes punctuation for word in word_list: CHANCE = (random.randint(1, 100)) / 100 try: # break middle for-loop for ms in misspellings: if (word in ms) and (CHANCE <= p): wrong = ms.split('->')[0] correct = ms.split('->')[1][:-2] # removes 'n' if ',' in correct: correct = random.choice(my_str.split(',')).strip() # only 1 correct spelling if correct in string: df['Sentence'][idx] = string.replace(correct, wrong) raise StopIteration except StopIteration: pass
misspellings_corpus.txt
(snippet):
affadvit,affa_dava,afadant,afadavate,afadavid,affidate,affidavent,afftadave,athadavid,affiadait,aphadivode,appidavid,afidaded,affi-davit,affidavat,aphadated,affivadat,afidaviat,affedavit,affiavate,affidaved,afefedavid,affidavate,affavidate,affdated,aphidavit,affevivat,affided,affadavid,attipdavid,affidavidit,affidavite,affadivate,affidavited,afdiodave,affidafet,affidivit,afadafit,affedit,afadavide,afidefed,Affi_David,affividate,affaidivit,afidiated,affidovt,affadavat,avadavate,effidavit,afidavit,aphadavid,afedaved,afardivient,apitated,affividative,affedaivite,afteradeated,Afi_David,acavated,affedated,affidevit,affidivat,afaedaviate,affedaved,afatait,afedative,avidated,afidavid,avidiate,afadavit,affedave,affedavid,afidaved,affavidit,afidated,afidavite,afodivid,affidated,afadiadid,affidaphet,affidatet,athadiet,afidabit,affidait,afadated,affadivit,affadavit,afadivite,affidavid,affadapfed,affdavit,aphedavid,athadavit,adivide,afdavit,afedavit,afadiatet,alpadavid,afadaviate,affadivid,aftedavid,affadavite,affadavate,apadenment,aphadavet->affidavit anverrsy,aneversary,anneversies,anniversity,anavuature,annevarcery,annerfversy,anervery,annaversary,anverserice,annaversery,Anniversary,anivrsary,ananersery,anaversie,anniverserie,annaversity,anifurcaty,anenany,anavirsary,aniversy,anverseary,annervesary,annerverarcy,anaveres,anerviersy,aneversy,aniversary,anivesery,anneversers,anirversary,anniversy,aniversere,aneversere,annaversrey,anavorasy,annversary,aniversiry,anerversurey,Amanversery,anniversery,aniversery,anniversiory,anniversily,anneversary,aneversiary,anaversery,anaversity,anniverserys,anerversary,anniverseray,aniverseray,anniverary,anivessery,anaversarie,aniversity,Annyver,annervirsary,anniversty,annevyercy,aniverusy,anarversieiy,onniver,anaversy,anversity,anaveje,anversicy,anniversay,anerversee,aneversarry,anifersery,anversy,aneversery,annaversiry,annivirsary,annivercery,anvesy,anvertery,annversy,anevers,anniverisy,aneversory,anternesery,avernity,Eenarcrsity,anivarisy,aniverserary,annaverserie,anniversaries,aniversay,anyversary,ananversery,annivesrey,anniversiry,annivesry,anniverscy,annerversery,amryvercary,anneversery,anerversery,anversa,anmersersy,aneversitey,aniversry,aniverserry->anniversary Ane->And agenst->agents eeg,agg->egg
Note: I can paste more example lines if wanted.
Advertisement
Answer
apply
can be used to invoke a function on each row and is much faster than a for loop (vectorized functions are even faster). I’ve done a few things to make life easier and more performant:
- convert your text file into a dict. This will be more performant and easier to work with than raw text.
- put all the corruption logic in a function. This will be easier to maintain and allows us to use
apply
- cleaned up/modified the logic a bit. What I show below is not exactly what you asked but should be easy to adapt.
ok, here is the code:
import io import random # this generates a dict {'word1':['list', 'of', 'misspellings'],} where s is a string copied above file df2 = pd.DataFrame(io.StringIO(s), columns=["subs"]) sub_dict = df2.subs.str.strip().str.split("->", expand=True).set_index(1)[0].str.split(",").to_dict() sub_dict["fake"] = ["fak", "fkae", "fke"] sub_dict["tell"] = ["tel"] sub_dict["this"] = ["tis", "htsi"] sub_dict["data"] = ["dat", "dta"] def corrupt(sentence, sub_dict, p=0.5): # logic is similar but not identical to your code for k, v in sub_dict.items(): if k in sentence and random.random() <= p: corrupted_word = random.choice(v) sentence = sentence.replace(k, corrupted_word) return sentence
Now the apply
bit:
df["corrupted"] = df.Sentence.apply(lambda sentence: corrupt(sentence, sub_dict)) # works as expected, see second sentence Forname Surname Sentence corrupted 0 Bob Smith Hi, this is a perfectly constructred sentence! Hi, this is a perfectly constructred sentence! 1 Alice Smith Can you tell this is fake data? Can you tel htsi is fake dta? 2 John Smith This poster needs help! This poster needs help! 3 Michael Smith Apparently, this poster is sturggling a bit LOL Apparently, this poster is sturggling a bit LOL 4 Daniel Smith More fake data here; ok. More fke dat here; ok. 5 Sarah Smith Will need to think up of better ideas. Will need to think up of better ideas. 6 Matthew Smith Love a good bit of Python, me. Love a good bit of Python, me. 7 Jane Smith Is this a sentence?! (I think so). Is this a sentence?! (I think so). 8 Peter Smith Remarkable - isn't it? Remarkable - isn't it? 9 Chloe Smith Foo Bar... that's all that is left to say. Foo Bar... that's all that is left to say.
now lets compare performance with a for-loop:
df_test1 = df.sample(n=10000, replace=True) df_test2 = df.sample(n=10000, replace=True) def loop(df): for idx, string in enumerate(df['Sentence']): corrupted_sentence = corrupt(string, sub_dict) df['Sentence'][idx] = corrupted_sentence %timeit df_test1.Sentence.apply(lambda sentence: corrupt(sentence, sub_dict)) # 36.5 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit loop(df_test2) # 5.19 s ± 98.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
woohoo! It’s way faster.