I have two questions:
1) I have a regular expression ([A-Z][a-z]{0,2})(d*)
and I am using Python’s re.finditer()
to match appropriate strings. My problem is, that I want to match only strings that contain no extra characters, otherwise I want to raise an exception.
I want to catch a following pattern: – capital letter, followed by 0, 1 or 2 small letters, followed by 0 or more numbers.
The pattern represents a chemical formula, i.e. atom followed by number of it’s occurences. I want to put the atom into a dictionary with it’s number of occurences, so I need to separate atoms (capital letter followed by 0, 1 or 2 small letters) and numbers, but remember that they belong together.
Example:
C6H5Fe2I # this string should be matched successfully. Result: C6 H5 Fe2 I H2TeO4 # this string should be matched successfully Result: H2 Te O4 H3PoooO5 # exception should be raised C2tH6 # exception should be raised
2) second question is what kind of Exception should I raise in case the input string is wrong.
Thank you, Tomas
Advertisement
Answer
Here’s a few different approaches you could use:
Compare lengths
- Find the length of the original string.
- Sum the length of the matched strings.
- If the two numbers differ there were unused characters.
Note that you can also combine this method with your existing code rather than doing it as an extra step if you want to avoid parsing the string twice.
Regular expression for entire string
You can check if this regular expression matches the entire string:
^([A-Z][a-z]{0,2}d*)*$
Tokenize
You can use the following regular expression to tokenize the original string:
[A-Z][^A-Z]*
Then check each token to see if it matches your original regular expression.