Extracting codes with regex (irregular regex keys)

Tags: , , , ,



I´m extracting the codes from a string list using coming from the title email. Which looks something like:

text_list = ['Industry / Gemany / PN M564839', 'Industry / France / PN: 575-439', 'Telecom / Gemany / P/N 26-59-29', 'Mobile / France / P/N: 88864839']

So far what I tried is:

def get_p_number(text):
    rx = re.compile(r'[p/n:]s+((?:w+(?:s+|$)){1})',
                    re.I)
    res = []
    m = rx.findall(text)
    if len(m) > 0:
        m = [p_number.replace(' ', '').upper() for p_number in m]
        m = remove_duplicates(m)
        res.append(m)
    else:
        res.append('no P Number found')
    return res

My issue is that, I´m not able to extract the code next to the words that goes before ['PN', 'P/N', 'PN:', 'P/N:'], specially if the code after starts with a letter (i.e ‘M’) or if it has a slash between it (i.e 26-59-29).

My desired output would be:

res = ['M564839','575-439','26-59-29','888489']

Answer

In your pattern the character class [p/n:]s+ will match one of the listed followed by 1+ whitespace chars. In the example data that will match a forward slash or a colon followed by a space.

The next part (?:w+(?:s+|$)) will match 1+ word characters followed by either the end of the string or 1+ whitespace chars without taking a whitespace char in the middle or a hyphen into account.

One option is to match PN with an optional : and / instead of using a character class [p/n:] and have your value in a capturing group:

/ P/?N:? ([w-]+)

Regex demo | Python demo

For example:

import re
text_list = ['Industry / Gemany / PN M564839', 'Industry / France / PN: 575-439', 'Telecom / Gemany / P/N 26-59-29', 'Mobile / France / P/N: 88864839']
regex = r"/ P/?N:? ([w-]+)"
res = []
for text in text_list: 
    matches = re.search(regex, text)
    if matches:
        res.append(matches.group(1))

print(res)

Result

['M564839', '575-439', '26-59-29', '88864839']


Source: stackoverflow