I am scraping a webpage using beautiful soup:
import requests from bs4 import BeautifulSoup r= requests.get("https://cooking.nytimes.com/recipes/1018849-classic-caprese-salad?action=click&module=Collection%20Page%20Recipe%20Card®ion=46%20Ways%20to%20Do%20Justice%20to%20Summer%20Tomatoes&pgType=collection&rank=1") c= r.content soup= BeautifulSoup(c, "html.parser") result= soup.find("script", {"type": "application/ld+json"}) print(type(result))
<class ‘bs4.element.Tag’> , 1
print(len(result))
0
Here is what ‘result’ looks like:
I am unable to access recipeIngredient (highlighted in the image) as a dictionary keys. It gives me a keyerror.
print(result['recipeIngredient'])
KeyError: ‘recipeIngredient’
How can I do this? I want to extract this from ‘result’:
“recipeIngredient”:[“1 pound fresh, best-quality mozzarella (preferably buffalo milk)”,”4 medium heirloom tomatoes”,”1 bunch fresh basil, leaves only, some reserved for garnish”,”Flaky sea salt, such as Maldon”,”Coarsely ground black pepper”,”High-quality extra-virgin olive oil”]
Advertisement
Answer
You would need to convert the data inside the script tag to json using json.loads
. In order to get the data inside the script tag use .get_text
method
import requests, json from bs4 import BeautifulSoup r= requests.get("https://cooking.nytimes.com/recipes/1018849-classic-caprese-salad?action=click&module=Collection%20Page%20Recipe%20Card®ion=46%20Ways%20to%20Do%20Justice%20to%20Summer%20Tomatoes&pgType=collection&rank=1") c= r.content soup= BeautifulSoup(c, "html.parser") result= soup.find("script", {"type": "application/ld+json"}) data = json.loads(result.get_text()) print(data["recipeIngredient"])
Output:
['1 pound fresh, best-quality mozzarella (preferably buffalo milk)', '4 medium heirloom tomatoes', '1 bunch fresh basil, leaves only, some reserved for garnish', 'Flaky sea salt, such as Maldon', 'Coarsely ground black pepper', 'High-quality extra-virgin olive oil']