Currently, I have successfully used python to scrape data from a competitor’s website to find out store information. The website has a map where you can enter a zip code and it will tell you all the stores in the area of a my current location. The website sends a GET request to pull store data by using this link:
https://www.homedepot.com/StoreSearchServices/v2/storesearch?address=37028&radius=50&pagesize=30
My goal is to scrape all store information not just the imaginary zip code = 12345 & pagesize=30. How should I go about getting all the store information? Would it be better to iterate through a dataset of zip codes to pull all the stores or is there a better way to do this? I’ve tried expanding past 30 page size but it looks like that is the limit on the request.
Advertisement
Answer
This url gives JSON with "currentPage":1
which can means it can use some kind of pagination.
I added &page=2
and it seems it works
Page 1:
Page 2:
Page 3:
For test I use bigger range=250
to get JSON with "recordCount":123
I found that it works also with pagesize=40
.
For bigger value it sends JSON with error message.
EDIT:
Minimal working code:
Page blocks request without User-Agent
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',
}
url = 'https://www.homedepot.com/StoreSearchServices/v2/storesearch'
payload = {
'address': 37028,
'radius': 250,
'pagesize': 40,
'page': 1,
}
page = 0
while True:
page += 1
print('--- page:', page, '---')
payload['page'] = page
response = requests.get(url, params=payload, headers=headers)
data = response.json()
print(data['searchReport'])
if "stores" not in data:
break
for number, item in enumerate(data['stores'], 1):
print(f'{number:2} | phone: {item["phone"]} | zip: {item["address"]["postalCode"]}')
Result:
--- page: 1 ---
{'recordCount': 123, 'currentPage': 1, 'storesPerPage': 40}
1 | phone: (931)906-2655 | zip: 37040
2 | phone: (270)442-0817 | zip: 42001
3 | phone: (615)662-7600 | zip: 37221
4 | phone: (615)865-9600 | zip: 37115
5 | phone: (615)228-3317 | zip: 37216
6 | phone: (615)269-7800 | zip: 37204
7 | phone: (615)824-2391 | zip: 37075
8 | phone: (615)370-0730 | zip: 37027
9 | phone: (615)889-7211 | zip: 37076
10 | phone: (615)599-4578 | zip: 37064
etc.
--- page: 2 ---
{'recordCount': 123, 'currentPage': 2, 'storesPerPage': 40}
1 | phone: (662)890-9470 | zip: 38654
2 | phone: (502)964-1845 | zip: 40219
3 | phone: (812)941-9641 | zip: 47150
4 | phone: (812)282-0470 | zip: 47129
5 | phone: (662)349-6080 | zip: 38637
6 | phone: (502)899-3706 | zip: 40207
7 | phone: (662)840-8390 | zip: 38866
8 | phone: (502)491-3682 | zip: 40220
9 | phone: (870)268-0619 | zip: 72404
10 | phone: (256)575-2100 | zip: 35768
etc.
If you want to keep as DataFrame
then maybe first put all items on list and later convert this list to DataFrame
# --- before loop ----
all_items = []
page = 0
# --- loop ----
while True:
# ... code ...
for number, item in enumerate(data['stores'], 1):
print(f'{number:2} | phone: {item["phone"]} | zip: {item["address"]["postalCode"]}')
all_items.append(item)
# --- after loop ----
import pandas as pd
df = pd.DataFrame(all_items)
print(df)
Because JSON keep address
as directory {'postCode': ... , ...}
so some columns may have it as directory
print(df.iloc[0])
storeId 0726
name Clarksville, TN
phone (931)906-2655
address {'postalCode': '37040', 'county': 'Montgomery'
coordinates {'lat': 36.581677, 'lng': -87.300826}
services {'loadNGo': True, 'propane': True, 'toolRental...
storeContacts [{'name': 'Brenda G.', 'role': 'Manager'}]
storeHours {'monday': {'open': '6:00', 'close': '21:00'},
url /l/Clarksville-TN/TN/Clarksville/37040/726
distance 32.530296
proDeskPhone (931)920-9400
flags {'bopisFlag': True, 'assemblyFlag': True, 'bos...
marketNbr 0019
axGeoCode 00
storeTimeZone CST6CDT
curbsidePickupHours {'monday': {'open': '09:00', 'close': '18:00'}
storeOpenDt 1998-08-13
storeType retail
toolRentalPhone NaN
See: { }
in address
, services
, storeHours
,etc
It may need also to convert it to separated rows.
df['address'].apply(pd.Series)
and concat it with original df
df2 = pd.concat( [df, df['address'].apply(pd.Series)], axis=1 )
The same way you may do with other columns.