If I were for example looking to track the price changes of MIDI keyboards on https://www.gear4music.com/Studio-MIDI-Controllers. I would need to extract all the URLs of the products pictured from the search and then loop through the URLs of the products and extract price info for each product. I can obtain the price data of an individual product by hard coding the URL but I cannot find a way to automate getting the URLs of multiple products.
So far I have tried this,
from bs4 import BeautifulSoup import requests url = "https://www.gear4music.com/Studio-MIDI- Controllers" response = requests.get(url) data = response.text soup = BeautifulSoup(data, 'lxml') tags = soup.find_all('a') for tag in tags: print(tag.get('href'))
This does produce a list of URLs but I cannot make out which ones relate specifically to the MIDI keyboards in that search query that I want to obtain the price product info of. Is there a better more specific way to obtain the URLs of the products only and not everything within the HTML file?
Advertisement
Answer
There are many ways how to obtain product links. One way could be select all <a>
tags which have data-g4m-inv=
attribute:
import requests from bs4 import BeautifulSoup url = "https://www.gear4music.com/Studio-MIDI-Controllers" soup = BeautifulSoup(requests.get(url).content, "html.parser") for a in soup.select("a[data-g4m-inv]"): print("https://www.gear4music.com" + a["href"])
Prints:
https://www.gear4music.com/Recording-and-Computers/SubZero-MiniPad-MIDI-Controller/P6E https://www.gear4music.com/Recording-and-Computers/SubZero-MiniControl-MIDI-Controller/P6D https://www.gear4music.com/Keyboards-and-Pianos/SubZero-MiniKey-25-Key-MIDI-Controller/JMR https://www.gear4music.com/Keyboards-and-Pianos/Nektar-SE25/2XWA https://www.gear4music.com/Keyboards-and-Pianos/Korg-nanoKONTROL2-USB-MIDI-Controller-Black/G8L https://www.gear4music.com/Recording-and-Computers/SubZero-ControlKey25-MIDI-Keyboard/221Y https://www.gear4music.com/Keyboards-and-Pianos/SubZero-CommandKey25-Universal-MIDI-Controller/221X ...