Skip to content
Advertisement

Regex – Negative Lookahead to match a string with any non-Chinese UTF characters

Intention

to create a regex which creates a match when there is any character, ASCII, Unicode or otherwise, which does not fall into any of the valid UTF-8 ranges for Chinese characters. The match itself does not matter, but rather that the non-Chinese characters are present. Note that the presence of rare, and unused Chinese characters within the UTF-8 charset is intended also. The function returns None when there is no match, thus indicating that the string passed in was pure Unicode Chinese to the caller function.

Code

Python 3.8

chineseRegexSet = "[u4E00-u9FFF]|[u3400-u4BDF]|[u20000-u2A6DF]|[u2A700-u2B73F]|[u2B740-u2B81F]|[u2B820-u2CEAF]|[uF900-uFAFF]|[u2F800-u2FA1F]"
def ContainsNonChineseCharacters(hanziWord):
    #negative lookahead
    match = search("(?!" + chineseRegexSet + ")+", hanziWord)
    if match:
        if _DEBUG:
            PrintDebugError(hanziWord)
            PrintDebugError(hanziWord, utfEncode=True)
    else:
        _LOGGER.debug(hanziWord)
        if _DEBUG:
            PrintDebug(hanziWord)
            PrintDebug(hanziWord, utfEncode=True)

    return match

Attempted Regex Solutions

Interpretation: any non-chinese character neg. lookahead set, at least once

(?![u4E00-u9FFF]|[u3400-u4BDF]|[u20000-u2A6DF]|[u2A700-u2B73F]|[u2B740-u2B81F]|[u2B820-u2CEAF]|[uF900-uFAFF]|[u2F800-u2FA1F])+

Interpretation: Any non-singular Chinese character from any UTF set

(?![u4E00-u9FFF]+|[u3400-u4BDF]+|[u20000-u2A6DF]+|[u2A700-u2B73F]+|[u2B740-u2B81F]+|[u2B820-u2CEAF]+|[uF900-uFAFF]+|[u2F800-u2FA1F]+)*

Test Cases

Case Expected Result

大家好 0 matches

00你是谁 >=1 match

s%%2 >=1 match

你Привет >=1 match

Thank you for your time!

Advertisement

Answer

First of all, u20000 doesn’t mean what you think it does. Because u sequences must be exactly 4 digits long, that’s refers to U+2000 and the digit 0. For characters above 0xFFFF, Python provides U, which must be followed by exactly 8 digits (e.g. U00020000).


Secondly,

[A-B]|[C-D]|...

is best written as

[A-BC-D...]

With the above fix and the above simplification, we have this:

[u3400-u4BDFu4E00-u9FFFuF900-uFAFFU00020000-U0002A6DFU0002A700-U0002B73FU0002B740-U0002B81FU0002B820-U0002CEAFU0002F800-U0002FA1F]

There are two ways of approaching the problem:

  1. Does the string contain only characters from that class?

    is_just_han = re.search("^[...]*$", str)     # or regex.search
    
  2. Does the string contain a character from outside of that class?

    is_just_han = not re.search("[^...]", str)   # or regex.search
    

If you use the regex module instead of the re module, you gain access to p{Han} (short for p{Script=Han}) and its negation P{Han} (short for P{Script=Han}). This Unicode property is a close match for the characters you are trying to match. I’ll let you determine if it’s right for you or not.

is_just_han = regex.search("^p{Han}*$", str)

is_just_han = regex.search("P{Han}", str)
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement