How to remove escape sequence like ‘xe2’ or ‘x0c’ in python

Question

I am working on a project (content based search), for that I am using 'pdftotext' command line utility in Ubuntu which writes all the text from pdf to some text file. But it also writes bullets, now when I'm reading the file to index each word, it also gets some escape sequence indexed(like 'x01').I know its because of bullets(•). I

Accepted Answer

The problem is that xXX is just a representation of a control character, not the character itself. Therefore, you can&#8217;t literally match x unless you&#8217;re working with the repr of the string.You can remove nonprintable characters using a character class:re.sub(r'[x00-x08x0bx0cx0e-x1fx7f-xff]', '', text)Example:>>> re.sub(r'[x00-x1fx7f-xff]', '', ''.join(map(chr, range(256))))' !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~'

Advertisement

Answer