Regex – Negative Lookahead to match a string with any non-Chinese UTF characters

Question

Intention to create a regex which creates a match when there is any character, ASCII, Unicode or otherwise, which does not fall into any of the valid UTF-8 ranges for Chinese characters. The match itself does not matter, but rather that the non-Chinese characters are present. Note that the presence of rare, and unused Chinese characters within the UTF-8 charset

Accepted Answer

First of all, u20000 doesn&#8217;t mean what you think it does. Because u sequences must be exactly 4 digits long, that&#8217;s refers to U+2000 and the digit 0. For characters above 0xFFFF, Python provides U, which must be followed by exactly 8 digits (e.g. U00020000).Secondly,[A-B]|[C-D]|...is best written as[A-BC-D...]With the above fix and the above simplification, we have this:[u3400-u4BDFu4E00-u9FFFuF900-uFAFFU00020000-U0002A6DFU0002A700-U0002B73FU0002B740-U0002B81FU0002B820-U0002CEAFU0002F800-U0002FA1F]There are two ways of approaching the problem:Does the string contain only characters from that class?is_just_han = re.search("^[...]*$", str)     # or regex.searchDoes the string contain a character from outside of that class?is_just_han = not re.search("[^...]", str)   # or regex.searchIf you use the regex module instead of the re module, you gain access to p{Han} (short for p{Script=Han}) and its negation P{Han} (short for P{Script=Han}). This Unicode property is a close match for the characters you are trying to match. I&#8217;ll let you determine if it&#8217;s right for you or not.is_just_han = regex.search("^p{Han}*$", str)is_just_han = regex.search("P{Han}", str)

Regex – Negative Lookahead to match a string with any non-Chinese UTF characters

Intention

Code

Attempted Regex Solutions

Test Cases

Advertisement

Answer