Intention
to create a regex which creates a match when there is any character, ASCII, Unicode or otherwise, which does not fall into any of the valid UTF-8 ranges for Chinese characters. The match itself does not matter, but rather that the non-Chinese characters are present. Note that the presence of rare, and unused Chinese characters within the UTF-8 charset is intended also. The function returns None when there is no match, thus indicating that the string passed in was pure Unicode Chinese to the caller function.
Code
Python 3.8
chineseRegexSet = "[u4E00-u9FFF]|[u3400-u4BDF]|[u20000-u2A6DF]|[u2A700-u2B73F]|[u2B740-u2B81F]|[u2B820-u2CEAF]|[uF900-uFAFF]|[u2F800-u2FA1F]" def ContainsNonChineseCharacters(hanziWord): #negative lookahead match = search("(?!" + chineseRegexSet + ")+", hanziWord) if match: if _DEBUG: PrintDebugError(hanziWord) PrintDebugError(hanziWord, utfEncode=True) else: _LOGGER.debug(hanziWord) if _DEBUG: PrintDebug(hanziWord) PrintDebug(hanziWord, utfEncode=True) return match
Attempted Regex Solutions
Interpretation: any non-chinese character neg. lookahead set, at least once
(?![u4E00-u9FFF]|[u3400-u4BDF]|[u20000-u2A6DF]|[u2A700-u2B73F]|[u2B740-u2B81F]|[u2B820-u2CEAF]|[uF900-uFAFF]|[u2F800-u2FA1F])+
Interpretation: Any non-singular Chinese character from any UTF set
(?![u4E00-u9FFF]+|[u3400-u4BDF]+|[u20000-u2A6DF]+|[u2A700-u2B73F]+|[u2B740-u2B81F]+|[u2B820-u2CEAF]+|[uF900-uFAFF]+|[u2F800-u2FA1F]+)*
Test Cases
Case
Expected Result
大家好
0 matches
00你是谁
>=1 match
s%%2
>=1 match
你Привет
>=1 match
Thank you for your time!
Advertisement
Answer
First of all, u20000
doesn’t mean what you think it does. Because u
sequences must be exactly 4 digits long, that’s refers to U+2000
and the digit 0
. For characters above 0xFFFF, Python provides U
, which must be followed by exactly 8 digits (e.g. U00020000
).
Secondly,
[A-B]|[C-D]|...
is best written as
[A-BC-D...]
With the above fix and the above simplification, we have this:
[u3400-u4BDFu4E00-u9FFFuF900-uFAFFU00020000-U0002A6DFU0002A700-U0002B73FU0002B740-U0002B81FU0002B820-U0002CEAFU0002F800-U0002FA1F]
There are two ways of approaching the problem:
Does the string contain only characters from that class?
is_just_han = re.search("^[...]*$", str) # or regex.search
Does the string contain a character from outside of that class?
is_just_han = not re.search("[^...]", str) # or regex.search
If you use the regex module instead of the re module, you gain access to p{Han}
(short for p{Script=Han}
) and its negation P{Han}
(short for P{Script=Han}
). This Unicode property is a close match for the characters you are trying to match. I’ll let you determine if it’s right for you or not.
is_just_han = regex.search("^p{Han}*$", str) is_just_han = regex.search("P{Han}", str)