Skip to content
Advertisement

Testing string membership using (in) keyword in python is very slow

I have the following text dataset:

4 million paragraphs of length between (10-60 words each).

JavaScript

Also I have a set of 30,000 unique sentences:

JavaScript

I want to check if ANY of the sentences in the set are in those 4 million paragraphs. If any of those 30,000 sentences are in one of those paragraphs I want to keep that particular paragraph, else I should discard it.

Here is my implementation, which works but for that amount of data it’s very slow.

JavaScript

How could I improve my code? I tried using swifter, it’s estimated that it will take around 5 hours for that amount of data!

Is there a way to speed things up, like dask? I’m open to the idea of using a different file format like CSV etc, for example, if reading data from disk.

Advertisement

Answer

JavaScript

It’s time to profile:

JavaScript
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement