Currently, I have successfully used python to scrape data from a competitor’s website to find out store information. The website has a map where you can enter a zip code and it will tell you all the stores in the area of a my current location. The website sends a GET request to pull store data by using this link:
https://www.homedepot.com/StoreSearchServices/v2/storesearch?address=37028&radius=50&pagesize=30
My goal is to scrape all store information not just the imaginary zip code = 12345 & pagesize=30. How should I go about getting all the store information? Would it be better to iterate through a dataset of zip codes to pull all the stores or is there a better way to do this? I’ve tried expanding past 30 page size but it looks like that is the limit on the request.
Advertisement
Answer
This url gives JSON with "currentPage":1
which can means it can use some kind of pagination.
I added &page=2
and it seems it works
Page 1:
Page 2:
Page 3:
For test I use bigger range=250
to get JSON with "recordCount":123
I found that it works also with pagesize=40
.
For bigger value it sends JSON with error message.
EDIT:
Minimal working code:
Page blocks request without User-Agent
import requests headers = { 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0', } url = 'https://www.homedepot.com/StoreSearchServices/v2/storesearch' payload = { 'address': 37028, 'radius': 250, 'pagesize': 40, 'page': 1, } page = 0 while True: page += 1 print('--- page:', page, '---') payload['page'] = page response = requests.get(url, params=payload, headers=headers) data = response.json() print(data['searchReport']) if "stores" not in data: break for number, item in enumerate(data['stores'], 1): print(f'{number:2} | phone: {item["phone"]} | zip: {item["address"]["postalCode"]}')
Result:
--- page: 1 --- {'recordCount': 123, 'currentPage': 1, 'storesPerPage': 40} 1 | phone: (931)906-2655 | zip: 37040 2 | phone: (270)442-0817 | zip: 42001 3 | phone: (615)662-7600 | zip: 37221 4 | phone: (615)865-9600 | zip: 37115 5 | phone: (615)228-3317 | zip: 37216 6 | phone: (615)269-7800 | zip: 37204 7 | phone: (615)824-2391 | zip: 37075 8 | phone: (615)370-0730 | zip: 37027 9 | phone: (615)889-7211 | zip: 37076 10 | phone: (615)599-4578 | zip: 37064 etc. --- page: 2 --- {'recordCount': 123, 'currentPage': 2, 'storesPerPage': 40} 1 | phone: (662)890-9470 | zip: 38654 2 | phone: (502)964-1845 | zip: 40219 3 | phone: (812)941-9641 | zip: 47150 4 | phone: (812)282-0470 | zip: 47129 5 | phone: (662)349-6080 | zip: 38637 6 | phone: (502)899-3706 | zip: 40207 7 | phone: (662)840-8390 | zip: 38866 8 | phone: (502)491-3682 | zip: 40220 9 | phone: (870)268-0619 | zip: 72404 10 | phone: (256)575-2100 | zip: 35768 etc.
If you want to keep as DataFrame
then maybe first put all items on list and later convert this list to DataFrame
# --- before loop ---- all_items = [] page = 0 # --- loop ---- while True: # ... code ... for number, item in enumerate(data['stores'], 1): print(f'{number:2} | phone: {item["phone"]} | zip: {item["address"]["postalCode"]}') all_items.append(item) # --- after loop ---- import pandas as pd df = pd.DataFrame(all_items) print(df)
Because JSON keep address
as directory {'postCode': ... , ...}
so some columns may have it as directory
print(df.iloc[0])
storeId 0726 name Clarksville, TN phone (931)906-2655 address {'postalCode': '37040', 'county': 'Montgomery'... coordinates {'lat': 36.581677, 'lng': -87.300826} services {'loadNGo': True, 'propane': True, 'toolRental... storeContacts [{'name': 'Brenda G.', 'role': 'Manager'}] storeHours {'monday': {'open': '6:00', 'close': '21:00'},... url /l/Clarksville-TN/TN/Clarksville/37040/726 distance 32.530296 proDeskPhone (931)920-9400 flags {'bopisFlag': True, 'assemblyFlag': True, 'bos... marketNbr 0019 axGeoCode 00 storeTimeZone CST6CDT curbsidePickupHours {'monday': {'open': '09:00', 'close': '18:00'}... storeOpenDt 1998-08-13 storeType retail toolRentalPhone NaN
See: { }
in address
, services
, storeHours
,etc
It may need also to convert it to separated rows.
df['address'].apply(pd.Series)
and concat it with original df
df2 = pd.concat( [df, df['address'].apply(pd.Series)], axis=1 )
The same way you may do with other columns.