As a hobby, I started doing a project with amazon textract which helps in extracting text from a photo or a pdf. Now I ran into a problem. According to what I read from it’s docs, every word in the photo is a small “block”. When I try printing, it prints fine, but if I have to use that text to send somewhere, like an email etc, I need the whole text as a single file. So I would need all blocks of texts to be stored in a single response to help my further use. This is where I am stuck for a few days. Help appreciated. Thank you
def processor(name):
textract = boto3.client('textract')
response = textract.detect_document_text(
Document = {
'S3Object':{
'Bucket':bucketName,
'Name':name
}
}
)
for item in response["Blocks"]:
if item["BlockType"] == "LINE":
print (item["Text"])
Advertisement
Answer
The one liner below should do the job
single_response = ' '.join(item["Text"] for item in response["Blocks"] if item["BlockType"] == "LINE")