How to convert Web PDF to Text

Question

I want to convert web PDF&#8217;s such as &#8211; https://archives.nseindia.com/corporate/ICRA_26012022091856_BSER3026012022.pdf & many more into a Text without saving them into my PC ,Cause 1000&#8217;s of such announcemennts come up daily , Hence wanted to convert them to text without saving them on my …

Accepted Answer

There is different methods to do this. But the simplest is to download locally the PDF then use one of following Python module to extract text (OCR) :pdfplumbertesseractpdftotext&#8230;Here is a simple code example for that (using pdfplumber)from urllib.request import urlopenimport pdfplumberurl = 'https://archives.nseindia.com/corporate/ICRA_26012022091856_BSER3026012022.pdf'response = urlopen(url)file = open("img.pdf", 'wb')file.write(response.read())file.close()try:    pdf = pdfplumber.open('img.pdf')except:     # Some files are not pdf, these are annexes and we don't want them. Or error reading the pdf (damaged ? )    print(f'Error. Are you sure this is a PDF ?')    continue#PDF plumber text extractionpage = pdf.pages[0]text = page.extract_text()EDIT : My bad, just realised you asked &#8220;without saving it to my PC&#8221;.That being said, I also scrap a lot (1000s aswell) of pdf, but all save them as &#8220;img.pdf&#8221; so they just keep replacing each other and end up with only 1 pdf file. I do not provide any solution for PDF OCR without saving the file. Sorry for that :'(

Advertisement

Answer