How can I scrape all the images from a website?

Question

I have a website where I'd like to get all the images from the website. The website is kind of a dynamic in nature, I tried using google's Agenty Chrome extension and followed the steps: I Choose one image that I want to extract using CSS selector, this will make the extension select the same other images automatically. Viewed the

Accepted Answer

I am assuming you want to download all images in the website. It is actually very easy to do this effectively using beautiful soup 4 (BS4). #code to find all images in a given webpagefrom bs4 import BeautifulSoupimport urllib.requestimport requestsimport shutilurl=('https://www.mcmaster.com/')html_page = urllib.request.urlopen(url)soup = BeautifulSoup(html_page, features="lxml")for img in soup.findAll('img'):    assa=(img.get('src'))new_image=(url+assa)You can also download the image with this tacked-on to the end:response = requests.get(my_url, stream=True)with open('Mypic.bmp', 'wb') as file:    shutil.copyfileobj(response.raw, file)Everything in two lines:from bs4 import BeautifulSoup; import urllib.request; from urllib.request import urlretrievefor img in (BeautifulSoup((urllib.request.urlopen("https://apod.nasa.gov/apod/astropix.html")), features="lxml")).findAll('img'): assa=(img.get('src')); urlretrieve(("https://apod.nasa.gov/apod/"+assa), "Mypic.bmp")The new image should be in the same directory as the python file, but can be moved with:os.rename()In the case of the McMaster website, the images are linked differently, so the above methods won&#8217;t work. The following code should get most of the images on the website:from bs4 import BeautifulSoupfrom urllib.request import Request, urlopenimport reimport urllib.requestimport shutilimport requestsreq = Request("https://www.mcmaster.com/")html_page = urlopen(req)soup = BeautifulSoup(html_page, "lxml")links = []for link in soup.findAll('link'):    links.append(link.get('href'))print(links)UPDATE: I found from some github post the below code that is MUCH more accurate:import requestsimport reimage_link_home=("https://images1.mcmaster.com/init/gfx/home/.*[0-9]")html_page = requests.get(('https://www.mcmaster.com/'),headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'}).textfor item in re.findall(image_link_home,html_page):    if str(item).startswith('http') and len(item) < 150:        print(item.strip())    else:        for elements in item.split('background-image:url('):            for item in re.findall(image_link_home,elements):                print((str(item).split('")')[0]).strip())Hope this helps!

Advertisement

Answer