Skip to content
Advertisement

How to remove escape sequence like ‘xe2’ or ‘x0c’ in python

I am working on a project (content based search), for that I am using ‘pdftotext’ command line utility in Ubuntu which writes all the text from pdf to some text file. But it also writes bullets, now when I’m reading the file to index each word, it also gets some escape sequence indexed(like ‘x01’).I know its because of bullets(•).

I want only text, so is there any way to remove this escape sequence.I have done something like this

escape_char = re.compile('+x[0123456789abcdef]*')
re.sub(escape_char, " ", string)

But this do not remove escape sequence

Thanks in advance.

Advertisement

Answer

The problem is that xXX is just a representation of a control character, not the character itself. Therefore, you can’t literally match x unless you’re working with the repr of the string.

You can remove nonprintable characters using a character class:

re.sub(r'[x00-x08x0bx0cx0e-x1fx7f-xff]', '', text)

Example:

>>> re.sub(r'[x00-x1fx7f-xff]', '', ''.join(map(chr, range(256))))
' !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~'
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement