In Python, how to check if a string only contains certain characters?
I need to check a string containing only a..z, 0..9, and . (period) and no other character.
I could iterate over each character and check the character is a..z or 0..9, or . but that would be slow.
I am not clear now how to do it with a regular expression.
Is this correct? Can you suggest a simpler regular expression or a more efficient approach.
#Valid chars . a-z 0-9 def check(test_str): import re #http://docs.python.org/library/re.html #re.search returns None if no position in the string matches the pattern #pattern to search for any character other then . a-z 0-9 pattern = r'[^.a-z0-9]' if re.search(pattern, test_str): #Character other then . a-z 0-9 was found print 'Invalid : %r' % (test_str,) else: #No character other then . a-z 0-9 was found print 'Valid : %r' % (test_str,) check(test_str='abcde.1') check(test_str='abcde.1#') check(test_str='ABCDE.12') check(test_str='_-/>"!@#12345abcde<') ''' Output: >>> Valid : "abcde.1" Invalid : "abcde.1#" Invalid : "ABCDE.12" Invalid : "_-/>"!@#12345abcde<" '''
Advertisement
Answer
Final(?) edit
Answer, wrapped up in a function, with annotated interactive session:
>>> import re >>> def special_match(strg, search=re.compile(r'[^a-z0-9.]').search): ... return not bool(search(strg)) ... >>> special_match("") True >>> special_match("az09.") True >>> special_match("az09.n") False # The above test case is to catch out any attempt to use re.match() # with a `$` instead of `Z` -- see point (6) below. >>> special_match("az09.#") False >>> special_match("az09.X") False >>>
Note: There is a comparison with using re.match() further down in this answer. Further timings show that match() would win with much longer strings; match() seems to have a much larger overhead than search() when the final answer is True; this is puzzling (perhaps it’s the cost of returning a MatchObject instead of None) and may warrant further rummaging.
==== Earlier text ====
The [previously] accepted answer could use a few improvements:
(1) Presentation gives the appearance of being the result of an interactive Python session:
reg=re.compile('^[a-z0-9.]+$') >>>reg.match('jsdlfjdsf12324..3432jsdflsdf') True
but match() doesn’t return True
(2) For use with match(), the ^
at the start of the pattern is redundant, and appears to be slightly slower than the same pattern without the ^
(3) Should foster the use of raw string automatically unthinkingly for any re pattern
(4) The backslash in front of the dot/period is redundant
(5) Slower than the OP’s code!
prompt>rem OP's version -- NOTE: OP used raw string! prompt>python26python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import re;reg=re.compile(r'[^a-z0-9.]')" "not bool(reg.search(t))" 1000000 loops, best of 3: 1.43 usec per loop prompt>rem OP's version w/o backslash prompt>python26python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import re;reg=re.compile(r'[^a-z0-9.]')" "not bool(reg.search(t))" 1000000 loops, best of 3: 1.44 usec per loop prompt>rem cleaned-up version of accepted answer prompt>python26python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import re;reg=re.compile(r'[a-z0-9.]+Z')" "bool(reg.match(t))" 100000 loops, best of 3: 2.07 usec per loop prompt>rem accepted answer prompt>python26python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import re;reg=re.compile('^[a-z0-9.]+$')" "bool(reg.match(t))" 100000 loops, best of 3: 2.08 usec per loop
(6) Can produce the wrong answer!!
>>> import re >>> bool(re.compile('^[a-z0-9.]+$').match('1234n')) True # uh-oh >>> bool(re.compile('^[a-z0-9.]+Z').match('1234n')) False