I´m extracting the codes from a string list using coming from the title email. Which looks something like:
text_list = ['Industry / Gemany / PN M564839', 'Industry / France / PN: 575-439', 'Telecom / Gemany / P/N 26-59-29', 'Mobile / France / P/N: 88864839']
So far what I tried is:
def get_p_number(text):
rx = re.compile(r'[p/n:]s+((?:w+(?:s+|$)){1})',
re.I)
res = []
m = rx.findall(text)
if len(m) > 0:
m = [p_number.replace(' ', '').upper() for p_number in m]
m = remove_duplicates(m)
res.append(m)
else:
res.append('no P Number found')
return res
My issue is that, I´m not able to extract the code next to the words that goes before ['PN', 'P/N', 'PN:', 'P/N:']
, specially if the code after starts with a letter (i.e ‘M’) or if it has a slash between it (i.e 26-59-29).
My desired output would be:
res = ['M564839','575-439','26-59-29','888489']
Advertisement
Answer
In your pattern the character class [p/n:]s+
will match one of the listed followed by 1+ whitespace chars. In the example data that will match a forward slash or a colon followed by a space.
The next part (?:w+(?:s+|$))
will match 1+ word characters followed by either the end of the string or 1+ whitespace chars without taking a whitespace char in the middle or a hyphen into account.
One option is to match PN with an optional :
and /
instead of using a character class [p/n:]
and have your value in a capturing group:
/ P/?N:? ([w-]+)
For example:
import re
text_list = ['Industry / Gemany / PN M564839', 'Industry / France / PN: 575-439', 'Telecom / Gemany / P/N 26-59-29', 'Mobile / France / P/N: 88864839']
regex = r"/ P/?N:? ([w-]+)"
res = []
for text in text_list:
matches = re.search(regex, text)
if matches:
res.append(matches.group(1))
print(res)
Result
['M564839', '575-439', '26-59-29', '88864839']