As a hobby, I started doing a project with amazon textract which helps in extracting text from a photo or a pdf. Now I ran into a problem. According to what I read from it’s docs, every word in the photo is a small “block”. When I try printing, it prints fine, but if I have to use that text to send somewhere, like an email etc, I need the whole text as a single file. So I would need all blocks of texts to be stored in a single response to help my further use. This is where I am stuck for a few days. Help appreciated. Thank you
def processor(name): textract = boto3.client('textract') response = textract.detect_document_text( Document = { 'S3Object':{ 'Bucket':bucketName, 'Name':name } } ) for item in response["Blocks"]: if item["BlockType"] == "LINE": print (item["Text"])
Advertisement
Answer
The one liner below should do the job
single_response = ' '.join(item["Text"] for item in response["Blocks"] if item["BlockType"] == "LINE")