I am trying to scrape this website using bs4. Using inspect on particular car ad tile, I figured what I need to scrape in order to get the title & the link to the car’s page.
I am making use of the find_all() function of the bs4 library but the issue is that it’s not scraping the required info of all the cars. It returns only info of about 21, whereas it’s clearly visible on the website that there are about 2410 cars.
The relevant code:
from bs4 import BeautifulSoup as bs from urllib.request import Request, urlopen import re import requests url = 'https://www.cardekho.com/used-cars+in+bangalore' req = Request(url , headers={'User-Agent': 'Mozilla/5.0'}) webpage = urlopen(req).read() page_soup = bs(webpage,"html.parser") tags = page_soup.find_all("div","title") print(len(tags))
How to get info on all of the cars present on the page.
P.S – Want to point out just one thing, all the cars aren’t displayed at once. More car info gets loaded as you scroll down. Could it because of that? Not sure.
Advertisement
Answer
Ok, I’ve written up a sample code to show you how it can be done. Although the site has a convenient api that we can leverage, the first page is not available through the api, but is embedded in a script
tag in the html
code. This requires additional processing to extract. After that it is simply a matte of getting the json data from the api, parsing it to python dictionaries and appending the car entries to a list. The link to the api can be found when inspecting network activity in Chrome or Firefox while scrolling the site.
from bs4 import BeautifulSoup import re import json from subprocess import check_output import requests import time from tqdm import tqdm #tqdm is just to implement a progress bar, https://pypi.org/project/tqdm/ cars = [] #create empty list to which we will append the car dicts from the json data url = 'https://www.cardekho.com/used-cars+in+bangalore' r = requests.get(url , headers={'User-Agent': 'Mozilla/5.0'}) soup = BeautifulSoup(r.content.decode('utf-8'),"html.parser") s = soup.find('script', {"type":"application/ld+json"}).next_sibling #find the section with the json data. It looks for a script tage with application/ld+json type, and takes the next tag, which is the one with the data we need, see page source code js = 'window = {};n'+s.text.strip()+';nprocess.stdout.write(JSON.stringify(window.__INITIAL_STATE__));' #strip the text from unnecessary strings and load the json as python dict, taken from: https://stackoverflow.com/questions/54991571/extract-json-from-html-script-tag-with-beautifulsoup-in-python/54992015#54992015 with open('temp.js','w') as f: # save the sting to a javascript file f.write(js) data_site = json.loads(check_output(['node','temp.js'])) #execute the file with node, which will return the json data that will be loaded with json.loads. for i in data_site['items']: #iterate over the dict and append all cars to the empty list 'cars' cars.append(i) for page in tqdm(range(20, data_site['total_count'], 20)): #'pagefrom' in the api call is 20, 40, 60, etc. so create a range and loop it r = requests.get(f"https://www.cardekho.com/api/v1/usedcar/search?&cityId=105&connectoid=&lang_code=en®ionId=0&searchstring=used-cars%2Bin%2Bbangalore&pagefrom={page}&sortby=updated_date&sortorder=asc&mink=0&maxk=200000&dealer_id=®CityNames=®StateNames=", headers={'User-Agent': 'Mozilla/5.0'}) data = r.json() for i in data['data']['cars']: #iterate over the dict and append all cars to the empty list 'cars' cars.append(i) time.sleep(5) #wait a few seconds to avoid overloading the site
This will result in cars
being a list of dictionaries. The car names can be found in the vid
key, and the urls are present in the vlink
key.
You can load it into a pandas dataframe to explore the data:
import pandas as pd df = pd.DataFrame(cars)
df.head()
will output (I omitted the images column for readability):
loc | myear | bt | ft | km | it | pi | pn | pu | dvn | ic | ucid | sid | ip | oem | model | vid | city | vlink | p_numeric | webp_image | position | pageNo | centralVariantId | isExpiredModel | modelId | isGenuine | is_ftc | seller_location | utype | views | tmGaadiStore | cls | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Koramangala | 2014 | SUV | Diesel | 30,000 | 0 | https://images10.gaadicdn.com/usedcar_image/320×240/used_car_2206305_1614944913.jpg | 9.9 Lakh | Mahindra XUV500 W6 2WD | 13 | 3019084 | 9509A09F1673FE2566DF59EC54AAC05B | 1 | Mahindra | Mahindra XUV500 | Mahindra XUV500 2011-2015 W6 2WD | Bangalore | /used-car-details/used-Mahindra-XUV500-2011-2015-W6-2WD-cars-Bangalore_9509A09F1673FE2566DF59EC54AAC05B.htm | 990000 | https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/used_car_2206305_1614944913.webp | 1 | 1 | 3822 | True | 570 | 0 | 0 | {‘address’: ‘BDA Complex, 100 Feet Rd, 3rd Block, Koramangala 3 Block, Koramangala, Bengaluru, Karnataka 560034, Bangalore’, ‘lat’: 12.931, ‘lng’: 77.6228} | Dealer | 235 | False | ||
1 | Marathahalli Colony | 2017 | SUV | Petrol | 30,000 | 0 | https://images10.gaadicdn.com/usedcar_image/320×240/used_car_2203506_1614754307.jpeg | 7.85 Lakh | Ford Ecosport 1.5 Petrol Trend BSIV | 14 | 3015331 | 2C0E4C4E543D4792C1C3186B361F718B | 1 | Ford | Ford Ecosport | Ford Ecosport 2015-2021 1.5 Petrol Trend BSIV | Bangalore | /used-car-details/used-Ford-Ecosport-2015-2021-1.5-Petrol-Trend-BSIV-cars-Bangalore_2C0E4C4E543D4792C1C3186B361F718B.htm | 785000 | https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/used_car_2203506_1614754307.webp | 2 | 1 | 6086 | True | 175 | 0 | 0 | {‘address’: ‘2, Varthur Rd, Ayyappa Layout, Chandra Layout, Marathahalli, Bengaluru, Karnataka 560037, Marathahalli Colony, Bangalore’, ‘lat’: 12.956727624875453, ‘lng’: 77.70174980163576} | Dealer | 495 | False | ||
2 | Yelahanka | 2020 | SUV | Diesel | 13,969 | 0 | https://images10.gaadicdn.com/usedcar_image/320×240/usedcar_11_276591614316705_1614316747.jpg | 41 Lakh | Toyota Fortuner 2.8 4WD AT | 12 | 3007934 | BBC13FB62DF6840097AA62DDEA05BB04 | 1 | Toyota | Toyota Fortuner | Toyota Fortuner 2016-2021 2.8 4WD AT | Bangalore | /used-car-details/used-Toyota-Fortuner-2016-2021-2.8-4WD-AT-cars-Bangalore_BBC13FB62DF6840097AA62DDEA05BB04.htm | 4100000 | https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/usedcar_11_276591614316705_1614316747.webp | 3 | 1 | 7618 | True | 364 | 0 | 0 | {‘address’: ‘Sonnappanahalli Kempegowda Intl Airport Road Jala Uttarahalli Hobli, Yelahanka, Bangalore, Karnataka 560064’, ‘lat’: 13.1518821, ‘lng’: 77.6220694} | Dealer | 516 | False | ||
3 | Byatarayanapura | 2017 | Sedans | Diesel | 18,000 | 0 | https://images10.gaadicdn.com/usedcar_image/320×240/used_car_2202297_1615013237.jpg | 35 Lakh | Mercedes-Benz E-Class E250 CDI Avantgarde | 15 | 3013606 | 4553943A967049D873712AFFA5F65A56 | 1 | Mercedes-Benz | Mercedes-Benz E-Class | Mercedes-Benz E-Class 2009-2012 E250 CDI Avantgarde | Bangalore | /used-car-details/used-Mercedes-Benz-E-Class-2009-2012-E250-CDI-Avantgarde-cars-Bangalore_4553943A967049D873712AFFA5F65A56.htm | 3500000 | https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/used_car_2202297_1615013237.webp | 4 | 1 | 4611 | True | 674 | 0 | 0 | {‘address’: ‘NO 19, Near Traffic Signal, Byatanarayanapura, International Airport Road, Byatarayanapura, Bangalore, Karnataka 560085’, ‘lat’: 13.0669588, ‘lng’: 77.5928756} | Dealer | 414 | False | ||
4 | nan | 2015 | Sedans | Diesel | 80,000 | 0 | https://stimg.cardekho.com/pwa/img/noimage.svg | 12.5 Lakh | Skoda Octavia Elegance 2.0 TDI AT | 1 | 3002709 | 156E5F2317C0A3A3BF8C03FFC35D404C | 1 | Skoda | Skoda Octavia | Skoda Octavia 2013-2017 Elegance 2.0 TDI AT | Bangalore | /used-car-details/used-Skoda-Octavia-2013-2017-Elegance-2.0-TDI-AT-cars-Bangalore_156E5F2317C0A3A3BF8C03FFC35D404C.htm | 1250000 | 5 | 1 | 3092 | True | 947 | 0 | 0 | {‘lat’: 0, ‘lng’: 0} | Individual | 332 | False |
Or if you wish to explode the dict in seller_location
to columns, you can load it with df = pd.json_normalize(cars)
.
You can save all data to a csv
file: df.to_csv('output.csv')