Skip to content
Advertisement

Implementing for loops as batches

I’m performing 2 big for loop tasks on a dataframe column. The context being what I’m calling “text corruption”; turning perfectly structured text into text full of both missing punctuation and misspellings, to mimic human errors.

I found that running 10,000s rows was extremely slow, even after optimizing the for loops.


I discovered a process called Batching, on this post.

The top answer provides a concise template that I imagine is much faster than regular for loop iterations.

How might I use that answer to reimplement the following code? (I added a comment to it asking more about it).

Or; might there be any technique that makes my for loops considerably quicker?

JavaScript

misspellings_corpus.txt (snippet):

JavaScript

Note: I can paste more example lines if wanted.

Advertisement

Answer

apply can be used to invoke a function on each row and is much faster than a for loop (vectorized functions are even faster). I’ve done a few things to make life easier and more performant:

  • convert your text file into a dict. This will be more performant and easier to work with than raw text.
  • put all the corruption logic in a function. This will be easier to maintain and allows us to use apply
  • cleaned up/modified the logic a bit. What I show below is not exactly what you asked but should be easy to adapt.

ok, here is the code:

JavaScript

Now the apply bit:

JavaScript

now lets compare performance with a for-loop:

JavaScript

woohoo! It’s way faster.

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement