I want to convert web PDF’s such as – https://archives.nseindia.com/corporate/ICRA_26012022091856_BSER3026012022.pdf & many more into a Text without saving them into my PC ,Cause 1000’s of such announcemennts come up daily , Hence wanted to convert them to text without saving them on my PC. Any Python Code Solutions to this? Thanks
Advertisement
Answer
There is different methods to do this. But the simplest is to download locally the PDF then use one of following Python module to extract text (OCR) :
Here is a simple code example for that (using pdfplumber)
from urllib.request import urlopen import pdfplumber url = 'https://archives.nseindia.com/corporate/ICRA_26012022091856_BSER3026012022.pdf' response = urlopen(url) file = open("img.pdf", 'wb') file.write(response.read()) file.close() try: pdf = pdfplumber.open('img.pdf') except: # Some files are not pdf, these are annexes and we don't want them. Or error reading the pdf (damaged ? ) print(f'Error. Are you sure this is a PDF ?') continue #PDF plumber text extraction page = pdf.pages[0] text = page.extract_text()
EDIT : My bad, just realised you asked “without saving it to my PC”. That being said, I also scrap a lot (1000s aswell) of pdf, but all save them as “img.pdf” so they just keep replacing each other and end up with only 1 pdf file. I do not provide any solution for PDF OCR without saving the file. Sorry for that :'(