Regex within Pandas DataFrame – finding minimum length between characters

Tags: , , , ,



Edit: Updated for reproducibility

I am currently working within a Pandas DataFrame, with a list of strings held within each row of a column [Column A]. I am trying to extract the minimum distance between any sublist combination of a keyword list (List B)

ListB = [['abc','def'],['ghi','jkl'],['mno','pqr']]

whilst each row in the Dataframe column contains a list of strings.

import pandas as pd
import numpy as np

data = pd.DataFrame(np.array([['1', '2', ['random string to be searched abc def ghi jkl','random string to be searched abc','abc random string to be searched def']],
['4', '5', ['random string to be searched ghi jkl','random string to be searched',' mno random string to be searched pqr']],
['7', '8', ['abc random string to be searched def','random string to be searched mno pqr','random string to be searched']]]),
columns=['a', 'b', 'list_of_strings_to_search'])

At a high level, I am trying to search each string within the lists contained in data['list_of_strings_to_search'] for any sublist combination of the ListB elements (must satisfy both conditions), and return the ListB sublist which satisfies the condition, from which I can calculate the distance (in words) between each sublist element pair.

import pandas as pd
import numpy as np
import re

def find_distance_between_words(text, word_list):
  '''This function does not work as intended yet.'''

  keyword_list = [] 

  # iterates through all sublists in ListB:
  for i in word_list:
    # iterates through all strings within list in dataframe column:
    for strings in text:
      # determines the two words to search (iterates through word_list)
      word1, word2 = i[0], i[1]
      # use regex to find both words:
      p = re.compile('.*?'.join((word1, word2)))
      iterator = p.finditer(strings)
      # for each match, append the string:
      for match in iterator:
        keyword_list.append(match.group())

    return keyword_list


data['try'] = data['list_of_strings_to_search'].apply(find_distance_between_words, word_list = ListB)
  

expected output:

0    [abc def, ghi jkl, abc random string to be searched def]
1     [ghi jkl, mno random string to be searched pqr]
2    [abc random string to be searched def, mno pqr]

current output:

0    [abc def, abc random string to be searched def]
1                                                 []
2             [abc random string to be searched def]

However, from manual inspection of the strings and outputs, most regex combinations are not returned from the statement below, and I require all combinations held within each string:

for match in iterator:
  keyword_list.append(match.group())

I intend to return all of the sublist combinations present within each string (hence the iteration through the list of sublist candidate values), in order to assess the minimum distance between the elements.

Any help is greatly appreciated!!

Answer

Let’s traverse each list in the column list_of_strings_to_search inside a list comprehension, then for each string in the list use re.findall with a regex pattern to find the sub-string with minimum length between the specified keywords:

import re

pat = '|'.join(fr'{x}.*?{y}' for x, y in ListB)
data['result'] = [np.hstack([re.findall(pat, s) for s in l]) for l in data['list_of_strings_to_search']]

Result:

0    [abc def, ghi jkl, abc random string to be searched def]
1             [ghi jkl, mno random string to be searched pqr]
2             [abc random string to be searched def, mno pqr]
Name: result, dtype: object


Source: stackoverflow