I am fetching html source code of many pages from one website, I need to convert it into json object and combine with other elements in json doc. . I have seen many questions on same topic but non of them were helpful.
My code:
url = "https://totalhash.cymru.com/analysis/?1ce201cf28c6dd738fd4e65da55242822111bd9f"
htmlContent = requests.get(url, verify=False)
data = htmlContent.text
print("data",data)
jsonD = json.dumps(htmlContent.text)
jsonL = json.loads(jsonD)
ContentUrl='{ "url" : "'+str(urls)+'" ,'+"n"+' "uid" : "'+str(uniqueID)+'" ,n"page_content" : "'+jsonL+'" , n"date" : "'+finalDate+'"}'
above code gives me unicode type, however, when I put that output in jsonLint it gives me invalid json error. Can somebody help me understand how can I convert the complete html into a json objet?
Advertisement
Answer
jsonD = json.dumps(htmlContent.text) converts the raw HTML content into a JSON string representation.
jsonL = json.loads(jsonD) parses the JSON string back into a regular string/unicode object. This results in a no-op, as any escaping done by dumps() is reverted by loads(). jsonL contains the same data as htmlContent.text.
Try to use json.dumps to generate your final JSON instead of building the JSON by hand:
ContentUrl = json.dumps({
'url': str(urls),
'uid': str(uniqueID),
'page_content': htmlContent.text,
'date': finalDate
})