I´m extracting the codes from a string list using coming from the title email. Which looks something like:
text_list = ['Industry / Gemany / PN M564839', 'Industry / France / PN: 575-439', 'Telecom / Gemany / P/N 26-59-29', 'Mobile / France / P/N: 88864839']
So far what I tried is:
def get_p_number(text): rx = re.compile(r'[p/n:]s+((?:w+(?:s+|$)){1})', re.I) res = [] m = rx.findall(text) if len(m) > 0: m = [p_number.replace(' ', '').upper() for p_number in m] m = remove_duplicates(m) res.append(m) else: res.append('no P Number found') return res
My issue is that, I´m not able to extract the code next to the words that goes before ['PN', 'P/N', 'PN:', 'P/N:']
, specially if the code after starts with a letter (i.e ‘M’) or if it has a slash between it (i.e 26-59-29).
My desired output would be:
res = ['M564839','575-439','26-59-29','888489']
Advertisement
Answer
In your pattern the character class [p/n:]s+
will match one of the listed followed by 1+ whitespace chars. In the example data that will match a forward slash or a colon followed by a space.
The next part (?:w+(?:s+|$))
will match 1+ word characters followed by either the end of the string or 1+ whitespace chars without taking a whitespace char in the middle or a hyphen into account.
One option is to match PN with an optional :
and /
instead of using a character class [p/n:]
and have your value in a capturing group:
/ P/?N:? ([w-]+)
For example:
import re text_list = ['Industry / Gemany / PN M564839', 'Industry / France / PN: 575-439', 'Telecom / Gemany / P/N 26-59-29', 'Mobile / France / P/N: 88864839'] regex = r"/ P/?N:? ([w-]+)" res = [] for text in text_list: matches = re.search(regex, text) if matches: res.append(matches.group(1)) print(res)
Result
['M564839', '575-439', '26-59-29', '88864839']