Convert html source code to json object

Question

I am fetching html source code of many pages from one website, I need to convert it into json object and combine with other elements in json doc. . I have seen many questions on same topic but non of them were helpful. My code: above code gives me unicode type, however, when I put that output in jsonLint it

Accepted Answer

jsonD = json.dumps(htmlContent.text) converts the raw HTML content into a JSON string representation.jsonL = json.loads(jsonD) parses the JSON string back into a regular string/unicode object. This results in a no-op, as any escaping done by dumps() is reverted by loads(). jsonL contains the same data as htmlContent.text.Try to use json.dumps to generate your final JSON instead of building the JSON by hand:ContentUrl = json.dumps({    'url': str(urls),    'uid': str(uniqueID),    'page_content': htmlContent.text,    'date': finalDate})

Advertisement

Answer