Find non-ASCII line or character in file using Python [closed]

Question

Closed. This question needs debugging details. It is not currently accepting answers. Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question. Closed 2 years ago. Improve this question I am …

Accepted Answer

To be clear, cp1252 is not a &#8220;form of ASCII&#8221;, it&#8217;s an ASCII superset, so you&#8217;re really looking for non-cp1252 here.The simplest solution here is to just use the errors='replace' mode, then search each line for the replacement character:def get_failed_character(filepath):    with open(filepath, encoding=FLOW_FILE_ENCODING, errors='replace') as f:        for num, line in enumerate(f, 1):            if 'ufffd' in line:  # 0xFFFD is the Unicode replacement character                print(num)I will note that this is not a particularly safe way of checking; cp1252 has mappings for all but five possible bytes, so it&#8217;s fairly likely that text in some other ASCII superset encoding will pass this test (it&#8217;ll just produce gibberish for the bytes outside the ASCII range). This is why ASCII supersets (aside from UTF-8) are such a bad idea; without knowing the encoding ahead of time, you&#8217;re likely to successfully decode the text to garbage, because most supersets can map data intended to be in one encoding to themselves without error, it&#8217;s just gibberish to human beings. You need to know the real encoding, or you&#8217;re just making bad guesses.If your goal is to find the non-ASCII cp1252 characters (your question is worded a little unclearly), this will still work, just change the argument to encoding='ascii' so all non-ASCII becomes the replacement character.

Advertisement

Answer