I am trying to read a pdf using python and the content has many newline (crlf) characters. I tried removing them using below code:
JavaScript
x
8
1
from tika import parser
2
3
filename = 'myfile.pdf'
4
raw = parser.from_file(filename)
5
content = raw['content']
6
content = content.replace("rn", "")
7
print(content)
8
But the output remains unchanged. I tried using double backslashes also which didn’t fix the issue. can someone please advise?
Advertisement
Answer
I don’t have access to your pdf file, so I processed one on my system. I also don’t know if you need to remove all new lines or just double new lines. The code below remove double new lines, which makes the output more readable.
Please let me know if this works for your current needs.
JavaScript
1
18
18
1
from tika import parser
2
3
filename = 'myfile.pdf'
4
5
# Parse the PDF
6
parsedPDF = parser.from_file(filename)
7
8
# Extract the text content from the parsed PDF
9
pdf = parsedPDF["content"]
10
11
# Convert double newlines into single newlines
12
pdf = pdf.replace('nn', 'n')
13
14
#####################################
15
# Do something with the PDF
16
#####################################
17
print (pdf)
18