Skip to content
Advertisement

Convert html source code to json object

I am fetching html source code of many pages from one website, I need to convert it into json object and combine with other elements in json doc. . I have seen many questions on same topic but non of them were helpful.

My code:

url = "https://totalhash.cymru.com/analysis/?1ce201cf28c6dd738fd4e65da55242822111bd9f"
htmlContent = requests.get(url, verify=False)
data = htmlContent.text
print("data",data)
jsonD = json.dumps(htmlContent.text)
jsonL = json.loads(jsonD)

ContentUrl='{ "url" : "'+str(urls)+'" ,'+"n"+' "uid" : "'+str(uniqueID)+'" ,n"page_content" : "'+jsonL+'" , n"date" : "'+finalDate+'"}'

above code gives me unicode type, however, when I put that output in jsonLint it gives me invalid json error. Can somebody help me understand how can I convert the complete html into a json objet?

Advertisement

Answer

jsonD = json.dumps(htmlContent.text) converts the raw HTML content into a JSON string representation. jsonL = json.loads(jsonD) parses the JSON string back into a regular string/unicode object. This results in a no-op, as any escaping done by dumps() is reverted by loads(). jsonL contains the same data as htmlContent.text.

Try to use json.dumps to generate your final JSON instead of building the JSON by hand:

ContentUrl = json.dumps({
    'url': str(urls),
    'uid': str(uniqueID),
    'page_content': htmlContent.text,
    'date': finalDate
})
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement