Skip to content
Advertisement

How to extract exact match from list of strings in Python into separate lists

This is an example list of strings

new_text = ['XIC(Switch_A)OTE(Light1) XIC(Light1)OTE(Light2) Motor On Delay Timer XIC(Light1)TON(Motor_timer',
 '?',
 '?) XIC(Motor_timer.DN)OTE(Motor)']

I would like to extract XIC(Switch_A) into one list, OTE(Light1) into another list, TON(Motor_timer) into another list and so on.

This is the code in Python 3 that I have tried

for words in new_text:
    match = re.search('XIC(.*)', words)
print(match.group(1))

How do I go about extracting OTE(Tag name), XIC(Tag name), XIO(Tag name) into their own lists or groups?

Advertisement

Answer

You can use the following regex to match any three uppercase letters, followed by anything in parentheses:

([A-Z]{3})(([^)]+))
(        )             : Capturing group 1
          (         )  : Capturing group 2
 [A-Z]{3}              : Exactly three uppercase letters
           (     )   : Literal open/close parentheses
             [^)]+     : One or more of any character that is not )

Regex101

Use a collections.defaultdict to keep track of all your results. The identifier will be the key for this defaultdict, and the values will be lists containing all the matches for that identifier.

from collections import defaultdict
results = defaultdict(list)

regex = re.compile(r"([A-Z]{3})(([^)]+))")

for s in new_text:
    matches = regex.findall(s)
    for m in matches: 
        identifier = m[0]
        results[identifier].append(m[0] + m[1])

Which gives the following results:

{'XIC': ['XIC(Switch_A)', 'XIC(Light1)', 'XIC(Light1)', 'XIC(Motor_timer.DN)'],
 'OTE': ['OTE(Light1)', 'OTE(Light2)', 'OTE(Motor)']}

Since you have a fixed set of identifiers, you can replace the [A-Z]{3} portion of the regex with something that will only match your identifiers:

regex = re.compile(r"(XIC|XIO|OTE|TON|TOF)(([^)]+))")

It is also possible to build this regex if you have your identifiers in an iterable:

identifiers = ["XIC", "XIO", "OTE", "TON", "TOF"]
regex = re.compile(rf"({'|'.join(identifiers)})(([^)]+))")
Advertisement