I have a website where I’d like to get all the images from the website.
The website is kind of a dynamic in nature, I tried using google’s Agenty Chrome extension and followed the steps:
- I Choose one image that I want to extract using CSS selector, this will make the extension select the same other images automatically.
- Viewed the Show button and select ATTR(attribute).
- Changed src as an ATTR field.
- Gave a name field name option.
- Saved it & ran it in using Agenty platform/API.
This should yield me the result but it’s not, it is returning an empty output.
Is there any better option? Will BS4 a better option for this? Any help is appreciated.
Advertisement
Answer
I am assuming you want to download all images in the website. It is actually very easy to do this effectively using beautiful soup 4 (BS4).
#code to find all images in a given webpage from bs4 import BeautifulSoup import urllib.request import requests import shutil url=('https://www.mcmaster.com/') html_page = urllib.request.urlopen(url) soup = BeautifulSoup(html_page, features="lxml") for img in soup.findAll('img'): assa=(img.get('src')) new_image=(url+assa)
You can also download the image with this tacked-on to the end:
response = requests.get(my_url, stream=True) with open('Mypic.bmp', 'wb') as file: shutil.copyfileobj(response.raw, file)
Everything in two lines:
from bs4 import BeautifulSoup; import urllib.request; from urllib.request import urlretrieve for img in (BeautifulSoup((urllib.request.urlopen("https://apod.nasa.gov/apod/astropix.html")), features="lxml")).findAll('img'): assa=(img.get('src')); urlretrieve(("https://apod.nasa.gov/apod/"+assa), "Mypic.bmp")
The new image should be in the same directory as the python file, but can be moved with:
os.rename()
In the case of the McMaster website, the images are linked differently, so the above methods won’t work. The following code should get most of the images on the website:
from bs4 import BeautifulSoup from urllib.request import Request, urlopen import re import urllib.request import shutil import requests req = Request("https://www.mcmaster.com/") html_page = urlopen(req) soup = BeautifulSoup(html_page, "lxml") links = [] for link in soup.findAll('link'): links.append(link.get('href')) print(links)
UPDATE: I found from some github post the below code that is MUCH more accurate:
import requests import re image_link_home=("https://images1.mcmaster.com/init/gfx/home/.*[0-9]") html_page = requests.get(('https://www.mcmaster.com/'),headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'}).text for item in re.findall(image_link_home,html_page): if str(item).startswith('http') and len(item) < 150: print(item.strip()) else: for elements in item.split('background-image:url('): for item in re.findall(image_link_home,elements): print((str(item).split('")')[0]).strip())
Hope this helps!