I am scraping a webpage using beautiful soup:
import requests
from bs4 import BeautifulSoup
r= requests.get("https://cooking.nytimes.com/recipes/1018849-classic-caprese-salad?action=click&module=Collection%20Page%20Recipe%20Card®ion=46%20Ways%20to%20Do%20Justice%20to%20Summer%20Tomatoes&pgType=collection&rank=1")
c= r.content
soup= BeautifulSoup(c, "html.parser")
result= soup.find("script", {"type": "application/ld+json"})
print(type(result))
<class ‘bs4.element.Tag’> , 1
print(len(result))
0
Here is what ‘result’ looks like:
I am unable to access recipeIngredient (highlighted in the image) as a dictionary keys. It gives me a keyerror.
print(result['recipeIngredient'])
KeyError: ‘recipeIngredient’
How can I do this? I want to extract this from ‘result’:
“recipeIngredient”:[“1 pound fresh, best-quality mozzarella (preferably buffalo milk)”,”4 medium heirloom tomatoes”,”1 bunch fresh basil, leaves only, some reserved for garnish”,”Flaky sea salt, such as Maldon”,”Coarsely ground black pepper”,”High-quality extra-virgin olive oil”]
Advertisement
Answer
You would need to convert the data inside the script tag to json using json.loads. In order to get the data inside the script tag use .get_text method
import requests, json
from bs4 import BeautifulSoup
r= requests.get("https://cooking.nytimes.com/recipes/1018849-classic-caprese-salad?action=click&module=Collection%20Page%20Recipe%20Card®ion=46%20Ways%20to%20Do%20Justice%20to%20Summer%20Tomatoes&pgType=collection&rank=1")
c= r.content
soup= BeautifulSoup(c, "html.parser")
result= soup.find("script", {"type": "application/ld+json"})
data = json.loads(result.get_text())
print(data["recipeIngredient"])
Output:
['1 pound fresh, best-quality mozzarella (preferably buffalo milk)', '4 medium heirloom tomatoes', '1 bunch fresh basil, leaves only, some reserved for garnish', 'Flaky sea salt, such as Maldon', 'Coarsely ground black pepper', 'High-quality extra-virgin olive oil']
