Find non-ASCII line or character in file using Python [closed]

Tags: , , , ,



I am trying to write a script to find out which line in a file contains non ASCII-characters (specifically “windows-1252”). I have written this script in the hope that it would error when it reaches the line which contains the wrong character:

import argparse

FLOW_FILE_ENCODING = "windows-1252"


def get_failed_character(filepath):
    with open(filepath, encoding=FLOW_FILE_ENCODING) as f:
        for num, line in enumerate(f, 1):
            try:
                line.strip()
            except:
                print(num)


if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description="Parse file."
    )
    parser.add_argument("--file", help="File name")
    args = parser.parse_args()

    get_failed_character(args.file)

Answer

To be clear, cp1252 is not a “form of ASCII”, it’s an ASCII superset, so you’re really looking for non-cp1252 here.

The simplest solution here is to just use the errors='replace' mode, then search each line for the replacement character:

def get_failed_character(filepath):
    with open(filepath, encoding=FLOW_FILE_ENCODING, errors='replace') as f:
        for num, line in enumerate(f, 1):
            if 'ufffd' in line:  # 0xFFFD is the Unicode replacement character
                print(num)

I will note that this is not a particularly safe way of checking; cp1252 has mappings for all but five possible bytes, so it’s fairly likely that text in some other ASCII superset encoding will pass this test (it’ll just produce gibberish for the bytes outside the ASCII range). This is why ASCII supersets (aside from UTF-8) are such a bad idea; without knowing the encoding ahead of time, you’re likely to successfully decode the text to garbage, because most supersets can map data intended to be in one encoding to themselves without error, it’s just gibberish to human beings. You need to know the real encoding, or you’re just making bad guesses.

If your goal is to find the non-ASCII cp1252 characters (your question is worded a little unclearly), this will still work, just change the argument to encoding='ascii' so all non-ASCII becomes the replacement character.



Source: stackoverflow