Skip to content
Advertisement

beautiful soup find_all() not returning all elements

I am trying to scrape this website using bs4. Using inspect on particular car ad tile, I figured what I need to scrape in order to get the title & the link to the car’s page.

I am making use of the find_all() function of the bs4 library but the issue is that it’s not scraping the required info of all the cars. It returns only info of about 21, whereas it’s clearly visible on the website that there are about 2410 cars.

The relevant code:

from bs4 import BeautifulSoup as bs
from urllib.request import Request, urlopen 
import re
import requests

url = 'https://www.cardekho.com/used-cars+in+bangalore'
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()

page_soup = bs(webpage,"html.parser")

tags = page_soup.find_all("div","title")

print(len(tags))

How to get info on all of the cars present on the page.

P.S – Want to point out just one thing, all the cars aren’t displayed at once. More car info gets loaded as you scroll down. Could it because of that? Not sure.

Advertisement

Answer

Ok, I’ve written up a sample code to show you how it can be done. Although the site has a convenient api that we can leverage, the first page is not available through the api, but is embedded in a script tag in the html code. This requires additional processing to extract. After that it is simply a matte of getting the json data from the api, parsing it to python dictionaries and appending the car entries to a list. The link to the api can be found when inspecting network activity in Chrome or Firefox while scrolling the site.

from bs4 import BeautifulSoup
import re
import json
from subprocess import check_output
import requests
import time
from tqdm import tqdm #tqdm is just to implement a progress bar, https://pypi.org/project/tqdm/

cars = [] #create empty list to which we will append the car dicts from the json data

url = 'https://www.cardekho.com/used-cars+in+bangalore'
r = requests.get(url , headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(r.content.decode('utf-8'),"html.parser")
s = soup.find('script', {"type":"application/ld+json"}).next_sibling #find the section with the json data. It looks for a script tage with application/ld+json type, and takes the next tag, which is the one with the data we need, see page source code

js = 'window = {};n'+s.text.strip()+';nprocess.stdout.write(JSON.stringify(window.__INITIAL_STATE__));' #strip the text from unnecessary strings and load the json as python dict, taken from: https://stackoverflow.com/questions/54991571/extract-json-from-html-script-tag-with-beautifulsoup-in-python/54992015#54992015
with open('temp.js','w') as f: # save the sting to a javascript file
    f.write(js)

data_site = json.loads(check_output(['node','temp.js'])) #execute the file with node, which will return the json data that will be loaded with json.loads.
for i in data_site['items']: #iterate over the dict and append all cars to the empty list 'cars'
  cars.append(i)

for page in tqdm(range(20, data_site['total_count'], 20)): #'pagefrom' in the api call is 20, 40, 60, etc. so create a range and loop it
  r = requests.get(f"https://www.cardekho.com/api/v1/usedcar/search?&cityId=105&connectoid=&lang_code=en&regionId=0&searchstring=used-cars%2Bin%2Bbangalore&pagefrom={page}&sortby=updated_date&sortorder=asc&mink=0&maxk=200000&dealer_id=&regCityNames=&regStateNames=", headers={'User-Agent': 'Mozilla/5.0'})
  data = r.json()

  for i in data['data']['cars']: #iterate over the dict and append all cars to the empty list 'cars'
    cars.append(i)

  time.sleep(5) #wait a few seconds to avoid overloading the site

This will result in cars being a list of dictionaries. The car names can be found in the vid key, and the urls are present in the vlink key. You can load it into a pandas dataframe to explore the data:

import pandas as pd
df = pd.DataFrame(cars)

df.head() will output (I omitted the images column for readability):

loc myear bt ft km it pi pn pu dvn ic ucid sid ip oem model vid city vlink p_numeric webp_image position pageNo centralVariantId isExpiredModel modelId isGenuine is_ftc seller_location utype views tmGaadiStore cls
0 Koramangala 2014 SUV Diesel 30,000 0 https://images10.gaadicdn.com/usedcar_image/320×240/used_car_2206305_1614944913.jpg 9.9 Lakh Mahindra XUV500 W6 2WD 13 3019084 9509A09F1673FE2566DF59EC54AAC05B 1 Mahindra Mahindra XUV500 Mahindra XUV500 2011-2015 W6 2WD Bangalore /used-car-details/used-Mahindra-XUV500-2011-2015-W6-2WD-cars-Bangalore_9509A09F1673FE2566DF59EC54AAC05B.htm 990000 https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/used_car_2206305_1614944913.webp 1 1 3822 True 570 0 0 {‘address’: ‘BDA Complex, 100 Feet Rd, 3rd Block, Koramangala 3 Block, Koramangala, Bengaluru, Karnataka 560034, Bangalore’, ‘lat’: 12.931, ‘lng’: 77.6228} Dealer 235 False
1 Marathahalli Colony 2017 SUV Petrol 30,000 0 https://images10.gaadicdn.com/usedcar_image/320×240/used_car_2203506_1614754307.jpeg 7.85 Lakh Ford Ecosport 1.5 Petrol Trend BSIV 14 3015331 2C0E4C4E543D4792C1C3186B361F718B 1 Ford Ford Ecosport Ford Ecosport 2015-2021 1.5 Petrol Trend BSIV Bangalore /used-car-details/used-Ford-Ecosport-2015-2021-1.5-Petrol-Trend-BSIV-cars-Bangalore_2C0E4C4E543D4792C1C3186B361F718B.htm 785000 https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/used_car_2203506_1614754307.webp 2 1 6086 True 175 0 0 {‘address’: ‘2, Varthur Rd, Ayyappa Layout, Chandra Layout, Marathahalli, Bengaluru, Karnataka 560037, Marathahalli Colony, Bangalore’, ‘lat’: 12.956727624875453, ‘lng’: 77.70174980163576} Dealer 495 False
2 Yelahanka 2020 SUV Diesel 13,969 0 https://images10.gaadicdn.com/usedcar_image/320×240/usedcar_11_276591614316705_1614316747.jpg 41 Lakh Toyota Fortuner 2.8 4WD AT 12 3007934 BBC13FB62DF6840097AA62DDEA05BB04 1 Toyota Toyota Fortuner Toyota Fortuner 2016-2021 2.8 4WD AT Bangalore /used-car-details/used-Toyota-Fortuner-2016-2021-2.8-4WD-AT-cars-Bangalore_BBC13FB62DF6840097AA62DDEA05BB04.htm 4100000 https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/usedcar_11_276591614316705_1614316747.webp 3 1 7618 True 364 0 0 {‘address’: ‘Sonnappanahalli Kempegowda Intl Airport Road Jala Uttarahalli Hobli, Yelahanka, Bangalore, Karnataka 560064’, ‘lat’: 13.1518821, ‘lng’: 77.6220694} Dealer 516 False
3 Byatarayanapura 2017 Sedans Diesel 18,000 0 https://images10.gaadicdn.com/usedcar_image/320×240/used_car_2202297_1615013237.jpg 35 Lakh Mercedes-Benz E-Class E250 CDI Avantgarde 15 3013606 4553943A967049D873712AFFA5F65A56 1 Mercedes-Benz Mercedes-Benz E-Class Mercedes-Benz E-Class 2009-2012 E250 CDI Avantgarde Bangalore /used-car-details/used-Mercedes-Benz-E-Class-2009-2012-E250-CDI-Avantgarde-cars-Bangalore_4553943A967049D873712AFFA5F65A56.htm 3500000 https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/used_car_2202297_1615013237.webp 4 1 4611 True 674 0 0 {‘address’: ‘NO 19, Near Traffic Signal, Byatanarayanapura, International Airport Road, Byatarayanapura, Bangalore, Karnataka 560085’, ‘lat’: 13.0669588, ‘lng’: 77.5928756} Dealer 414 False
4 nan 2015 Sedans Diesel 80,000 0 https://stimg.cardekho.com/pwa/img/noimage.svg 12.5 Lakh Skoda Octavia Elegance 2.0 TDI AT 1 3002709 156E5F2317C0A3A3BF8C03FFC35D404C 1 Skoda Skoda Octavia Skoda Octavia 2013-2017 Elegance 2.0 TDI AT Bangalore /used-car-details/used-Skoda-Octavia-2013-2017-Elegance-2.0-TDI-AT-cars-Bangalore_156E5F2317C0A3A3BF8C03FFC35D404C.htm 1250000 5 1 3092 True 947 0 0 {‘lat’: 0, ‘lng’: 0} Individual 332 False

Or if you wish to explode the dict in seller_location to columns, you can load it with df = pd.json_normalize(cars).

You can save all data to a csv file: df.to_csv('output.csv')

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement