I’d like to get only the numbers (integers) at the end of the phrases below:
VISTA AES TIETE E UNT N2 600 VISTA IT AUUNIBANCO PN N1 1.400 OPCAO DE VENDA 04/21 COGNP450ON 4,50COGNE 100.000
I mean: 600, 1400, 100000. I’ll add each one of them to a database later.
I tried to use regex: (?<=s)(d*s*)|(d*.d*)$
But it didn’t work properly. Any ideas?
PS: We use dots, not commas to represent a thousand: 1.000, instead of 1,000.
Advertisement
Answer
In the pattern that you tried, this part (?<=s)(d*s*)
matches optional digits, followed by optional whitespace chars while there must be a whitespace char directly to the left.
That will also get all the positions in the string where there is a whitspace char to the left, as the digits and the whitespace char in the match are optional.
In this part (d*.d*)$
the digits are optional, so it could also match just a dot at the end of the string.
If there has to be a whitespace char before the number at the end, you can use:
(?<=s)d{1,3}(?:.d{3})*$
The pattern matches:
(?<=s)
Positive lookbehind, assert a whitspace char to the left from the current positiond{1,3}
Match 1-3 digits(?:.d{3})*
Optionally repeat a dot and 3 digits$
End of string
See a regex demo.
If the number can also be by itself, you could assert a whitespace boundary to the left (?<!S)
(?<!S)d{1,3}(?:.d{3})*$
See another regex demo.
For example, using str.extract
and wrapping the pattern in a capture group:
import pandas as pd strings = [ "VISTA AES TIETE E UNT N2 600", "VISTA IT AUUNIBANCO PN N1 1.400", "OPCAO DE VENDA 04/21 COGNP450ON 4,50COGNE 100.000" ] df = pd.DataFrame(strings, columns=["colName"]) df['lastNumbers'] = df['colName'].str.extract(r"(?<=s)(d{1,3}(?:.d{3})*)$") print(df)
Output
colName lastNumbers 0 VISTA AES TIETE E UNT N2 600 600 1 VISTA IT AUUNIBANCO PN N1 1.400 1.400 2 OPCAO DE VENDA 04/21 COGNP450ON 4,50COGNE 100.000 100.000