How do web scrape more underlying data from a websites map location?

Question

Currently, I have successfully used python to scrape data from a competitor's website to find out store information. The website has a map where you can enter a zip code and it will tell you all the stores in the area of a my current location. The website sends a GET request to pull store data by using this link:

Accepted Answer

This url gives JSON with "currentPage":1 which can means it can use some kind of pagination.I added &page=2 and it seems it worksPage 1:https://www.homedepot.com/StoreSearchServices/v2/storesearch?address=37028&radius=250&pagesize=40&page=1Page 2:https://www.homedepot.com/StoreSearchServices/v2/storesearch?address=37028&radius=250&pagesize=40&page=2Page 3:https://www.homedepot.com/StoreSearchServices/v2/storesearch?address=37028&radius=250&pagesize=40&page=3For test I use bigger range=250 to get JSON with "recordCount":123I found that it works also with pagesize=40.For bigger value it sends JSON with error message.EDIT:Minimal working code:Page blocks request without User-Agentimport requestsheaders = {    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',}url = 'https://www.homedepot.com/StoreSearchServices/v2/storesearch'payload = {    'address': 37028,    'radius': 250,    'pagesize': 40,    'page': 1,}page = 0while True:    page += 1    print('--- page:', page, '---')        payload['page'] = page    response = requests.get(url, params=payload, headers=headers)        data = response.json()    print(data['searchReport'])                            if "stores" not in data:        break        for number, item in enumerate(data['stores'], 1):        print(f'{number:2} | phone: {item["phone"]} | zip: {item["address"]["postalCode"]}')Result:--- page: 1 ---{'recordCount': 123, 'currentPage': 1, 'storesPerPage': 40} 1 | phone: (931)906-2655 | zip: 37040 2 | phone: (270)442-0817 | zip: 42001 3 | phone: (615)662-7600 | zip: 37221 4 | phone: (615)865-9600 | zip: 37115 5 | phone: (615)228-3317 | zip: 37216 6 | phone: (615)269-7800 | zip: 37204 7 | phone: (615)824-2391 | zip: 37075 8 | phone: (615)370-0730 | zip: 37027 9 | phone: (615)889-7211 | zip: 3707610 | phone: (615)599-4578 | zip: 37064etc. --- page: 2 ---{'recordCount': 123, 'currentPage': 2, 'storesPerPage': 40} 1 | phone: (662)890-9470 | zip: 38654 2 | phone: (502)964-1845 | zip: 40219 3 | phone: (812)941-9641 | zip: 47150 4 | phone: (812)282-0470 | zip: 47129 5 | phone: (662)349-6080 | zip: 38637 6 | phone: (502)899-3706 | zip: 40207 7 | phone: (662)840-8390 | zip: 38866 8 | phone: (502)491-3682 | zip: 40220 9 | phone: (870)268-0619 | zip: 7240410 | phone: (256)575-2100 | zip: 35768etc.If you want to keep as DataFrame then maybe first put all items on list and later convert this list to DataFrame# --- before loop ----all_items = []page = 0# --- loop ----while True:    # ... code ...        for number, item in enumerate(data['stores'], 1):        print(f'{number:2} | phone: {item["phone"]} | zip: {item["address"]["postalCode"]}')        all_items.append(item)# --- after loop ----import pandas as pddf = pd.DataFrame(all_items)print(df)Because JSON keep address as directory {'postCode': ... , ...} so some columns may have it as directoryprint(df.iloc[0])storeId                                                             0726name                                                     Clarksville, TNphone                                                      (931)906-2655address                {'postalCode': '37040', 'county': 'Montgomery'...coordinates                        {'lat': 36.581677, 'lng': -87.300826}services               {'loadNGo': True, 'propane': True, 'toolRental...storeContacts                 [{'name': 'Brenda G.', 'role': 'Manager'}]storeHours             {'monday': {'open': '6:00', 'close': '21:00'},...url                           /l/Clarksville-TN/TN/Clarksville/37040/726distance                                                       32.530296proDeskPhone                                               (931)920-9400flags                  {'bopisFlag': True, 'assemblyFlag': True, 'bos...marketNbr                                                           0019axGeoCode                                                             00storeTimeZone                                                    CST6CDTcurbsidePickupHours    {'monday': {'open': '09:00', 'close': '18:00'}...storeOpenDt                                                   1998-08-13storeType                                                         retailtoolRentalPhone                                                      NaNSee: { } in address, services, storeHours,etcIt may need also to convert it to separated rows.df['address'].apply(pd.Series)and concat it with original dfdf2 = pd.concat( [df, df['address'].apply(pd.Series)], axis=1 )The same way you may do with other columns.

Advertisement

Answer