How to only scrape link from webpage – Python

Tags: , ,



My goal is to get each link

My code prints the href/link, however it also prints other junk which i do not want.

I only want the href/

from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import time
import requests
driver = webdriver.Chrome()
productlink=[]
for x in range (1,3):
    driver.get(f'https://meetinglibrary.asco.org/browse-meetings/2021%20Gastrointestinal%20Cancers%20Symposium?page={x}')
    time.sleep(3)
    page_source = driver.page_source
    soup = BeautifulSoup(page_source,'html.parser')
    productlist=soup.find_all('div',class_='session')
    for item in productlist:
        for link in item.find_all('a',class_='session__button ng-star-inserted',href=True):
            print(link)

Answer

Because href=True means get those tags with href attribute.There are still Tag. To get the href, you also need to use .get("href").Since there is only one button in each session tag, you could use find instead of find_all,and don’t forget to join the baseURL.Try code below:

from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import time
import requests
driver = webdriver.Chrome()
productlink=[]
baseURL = 'https://meetinglibrary.asco.org'
for x in range (1,3):
    driver.get(f'https://meetinglibrary.asco.org/browse-meetings/2021%20Gastrointestinal%20Cancers%20Symposium?page={x}')
    time.sleep(3)
    page_source = driver.page_source
    soup = BeautifulSoup(page_source,'html.parser')
    productlist=soup.find_all('div',class_='session')
    for item in productlist:
        print(baseURL + item.find('a',class_='session__button ng-star-inserted',href=True).get("href"))

Print:

https://meetinglibrary.asco.org/session/13455
https://meetinglibrary.asco.org/session/13458
https://meetinglibrary.asco.org/session/13445
https://meetinglibrary.asco.org/session/13450
https://meetinglibrary.asco.org/session/13460
https://meetinglibrary.asco.org/session/13462
https://meetinglibrary.asco.org/session/13464
https://meetinglibrary.asco.org/session/13459
https://meetinglibrary.asco.org/session/13446
https://meetinglibrary.asco.org/session/13451
https://meetinglibrary.asco.org/session/13461
https://meetinglibrary.asco.org/session/13463
https://meetinglibrary.asco.org/session/13465
https://meetinglibrary.asco.org/session/13399
https://meetinglibrary.asco.org/session/13443
https://meetinglibrary.asco.org/session/13444
https://meetinglibrary.asco.org/session/13352
https://meetinglibrary.asco.org/session/13381
https://meetinglibrary.asco.org/session/13383
https://meetinglibrary.asco.org/session/13372
https://meetinglibrary.asco.org/session/13382
https://meetinglibrary.asco.org/session/13447
https://meetinglibrary.asco.org/session/13849
https://meetinglibrary.asco.org/session/13384
https://meetinglibrary.asco.org/session/13389
https://meetinglibrary.asco.org/session/13453
https://meetinglibrary.asco.org/session/13859
https://meetinglibrary.asco.org/session/13391
https://meetinglibrary.asco.org/session/13392
https://meetinglibrary.asco.org/session/13394
....


Source: stackoverflow