I am trying to make a chatbot that can get Bing search results using Python. I’ve tried many websites, but they all use old Python 2 code or Google. I am currently in China and cannot access YouTube, Google, or anything else related to Google (Can’t use Azure and Microsoft Docs either). I want the results to be like this:
This is the title https://this-is-the-link.com This is the second title https://this-is-the-second-link.com
Code
import requests import bs4 import re import urllib.request from bs4 import BeautifulSoup page = urllib.request.urlopen("https://www.bing.com/search?q=programming") soup = BeautifulSoup(page.read()) links = soup.findAll("a") for link in links: print(link["href"])
And it gives me
/?FORM=Z9FD1 javascript:void(0); javascript:void(0); /rewards/dashboard /rewards/dashboard javascript:void(0); /?scope=web&FORM=HDRSC1 /images/search?q=programming&FORM=HDRSC2 /videos/search?q=programming&FORM=HDRSC3 /maps?q=programming&FORM=HDRSC4 /news/search?q=programming&FORM=HDRSC6 /shop?q=programming&FORM=SHOPTB http://go.microsoft.com/fwlink/?LinkId=521839 http://go.microsoft.com/fwlink/?LinkID=246338 https://go.microsoft.com/fwlink/?linkid=868922 http://go.microsoft.com/fwlink/?LinkID=286759 https://go.microsoft.com/fwlink/?LinkID=617297
Any help would be greatly appreciated (I’m using Python 3.6.9 on Ubuntu)
Advertisement
Answer
Actually, code you’ve written working properly, problem is in HTTP request headers. By default urllib
use Python-urllib/{version}
as User-Agent
header value, which makes easy for website to recognize your request as automatically generated. To avoid this, you should use custom value which can be achieved passing Request
object as first parameter of urlopen()
:
from urllib.parse import urlencode, urlunparse from urllib.request import urlopen, Request from bs4 import BeautifulSoup query = "programming" url = urlunparse(("https", "www.bing.com", "/search", "", urlencode({"q": query}), "")) custom_user_agent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0" req = Request(url, headers={"User-Agent": custom_user_agent}) page = urlopen(req) # Further code I've left unmodified soup = BeautifulSoup(page.read()) links = soup.findAll("a") for link in links: print(link["href"])
P.S. Take a look on comment left by @edd under your question.