I am working on a project (content based search), for that I am using ‘pdftotext’ command line utility in Ubuntu which writes all the text from pdf to some text file. But it also writes bullets, now when I’m reading the file to index each word, it also gets some escape sequence indexed(like ‘x01’).I know its because of bullets(•).
I want only text, so is there any way to remove this escape sequence.I have done something like this
escape_char = re.compile('+x[0123456789abcdef]*') re.sub(escape_char, " ", string)
But this do not remove escape sequence
Thanks in advance.
Advertisement
Answer
The problem is that xXX
is just a representation of a control character, not the character itself. Therefore, you can’t literally match x
unless you’re working with the repr
of the string.
You can remove nonprintable characters using a character class:
re.sub(r'[x00-x08x0bx0cx0e-x1fx7f-xff]', '', text)
Example:
>>> re.sub(r'[x00-x1fx7f-xff]', '', ''.join(map(chr, range(256)))) ' !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~'