Skip to content
Advertisement

Regex within Pandas DataFrame – finding minimum length between characters

Edit: Updated for reproducibility

I am currently working within a Pandas DataFrame, with a list of strings held within each row of a column [Column A]. I am trying to extract the minimum distance between any sublist combination of a keyword list (List B)

JavaScript

whilst each row in the Dataframe column contains a list of strings.

JavaScript

At a high level, I am trying to search each string within the lists contained in data['list_of_strings_to_search'] for any sublist combination of the ListB elements (must satisfy both conditions), and return the ListB sublist which satisfies the condition, from which I can calculate the distance (in words) between each sublist element pair.

JavaScript

expected output:

JavaScript

current output:

JavaScript

However, from manual inspection of the strings and outputs, most regex combinations are not returned from the statement below, and I require all combinations held within each string:

JavaScript

I intend to return all of the sublist combinations present within each string (hence the iteration through the list of sublist candidate values), in order to assess the minimum distance between the elements.

Any help is greatly appreciated!!

Advertisement

Answer

Let’s traverse each list in the column list_of_strings_to_search inside a list comprehension, then for each string in the list use re.findall with a regex pattern to find the sub-string with minimum length between the specified keywords:

JavaScript

Result:

JavaScript
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement