Skip to content

Regex – Negative Lookahead to match a string with any non-Chinese UTF characters


to create a regex which creates a match when there is any character, ASCII, Unicode or otherwise, which does not fall into any of the valid UTF-8 ranges for Chinese characters. The match itself does not matter, but rather that the non-Chinese characters are present. Note that the presence of rare, and unused Chinese characters within the UTF-8 charset is intended also. The function returns None when there is no match, thus indicating that the string passed in was pure Unicode Chinese to the caller function.


Python 3.8

chineseRegexSet = "[u4E00-u9FFF]|[u3400-u4BDF]|[u20000-u2A6DF]|[u2A700-u2B73F]|[u2B740-u2B81F]|[u2B820-u2CEAF]|[uF900-uFAFF]|[u2F800-u2FA1F]"
def ContainsNonChineseCharacters(hanziWord):
    #negative lookahead
    match = search("(?!" + chineseRegexSet + ")+", hanziWord)
    if match:
        if _DEBUG:
            PrintDebugError(hanziWord, utfEncode=True)
        if _DEBUG:
            PrintDebug(hanziWord, utfEncode=True)

    return match

Attempted Regex Solutions

Interpretation: any non-chinese character neg. lookahead set, at least once


Interpretation: Any non-singular Chinese character from any UTF set


Test Cases

Case Expected Result

大家好 0 matches

00你是谁 >=1 match

s%%2 >=1 match

你Привет >=1 match

Thank you for your time!



First of all, u20000 doesn’t mean what you think it does. Because u sequences must be exactly 4 digits long, that’s refers to U+2000 and the digit 0. For characters above 0xFFFF, Python provides U, which must be followed by exactly 8 digits (e.g. U00020000).



is best written as


With the above fix and the above simplification, we have this:


There are two ways of approaching the problem:

  1. Does the string contain only characters from that class?

    is_just_han ="^[...]*$", str)     # or
  2. Does the string contain a character from outside of that class?

    is_just_han = not"[^...]", str)   # or

If you use the regex module instead of the re module, you gain access to p{Han} (short for p{Script=Han}) and its negation P{Han} (short for P{Script=Han}). This Unicode property is a close match for the characters you are trying to match. I’ll let you determine if it’s right for you or not.

is_just_han ="^p{Han}*$", str)

is_just_han ="P{Han}", str)
3 People found this is helpful