I am trying to make a chatbot that can get Bing search results using Python. I’ve tried many websites, but they all use old Python 2 code or Google. I am currently in China and cannot access YouTube, Google, or anything else related to Google (Can’t use Azure and Microsoft Docs either). I want the results to be like this:
This is the title https://this-is-the-link.com This is the second title https://this-is-the-second-link.com
Code
import requests
import bs4
import re
import urllib.request
from bs4 import BeautifulSoup
page = urllib.request.urlopen("https://www.bing.com/search?q=programming")
soup = BeautifulSoup(page.read())
links = soup.findAll("a")
for link in links:
print(link["href"])
And it gives me
/?FORM=Z9FD1 javascript:void(0); javascript:void(0); /rewards/dashboard /rewards/dashboard javascript:void(0); /?scope=web&FORM=HDRSC1 /images/search?q=programming&FORM=HDRSC2 /videos/search?q=programming&FORM=HDRSC3 /maps?q=programming&FORM=HDRSC4 /news/search?q=programming&FORM=HDRSC6 /shop?q=programming&FORM=SHOPTB http://go.microsoft.com/fwlink/?LinkId=521839 http://go.microsoft.com/fwlink/?LinkID=246338 https://go.microsoft.com/fwlink/?linkid=868922 http://go.microsoft.com/fwlink/?LinkID=286759 https://go.microsoft.com/fwlink/?LinkID=617297
Any help would be greatly appreciated (I’m using Python 3.6.9 on Ubuntu)
Advertisement
Answer
Actually, code you’ve written working properly, problem is in HTTP request headers. By default urllib use Python-urllib/{version} as User-Agent header value, which makes easy for website to recognize your request as automatically generated. To avoid this, you should use custom value which can be achieved passing Request object as first parameter of urlopen():
from urllib.parse import urlencode, urlunparse
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
query = "programming"
url = urlunparse(("https", "www.bing.com", "/search", "", urlencode({"q": query}), ""))
custom_user_agent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"
req = Request(url, headers={"User-Agent": custom_user_agent})
page = urlopen(req)
# Further code I've left unmodified
soup = BeautifulSoup(page.read())
links = soup.findAll("a")
for link in links:
print(link["href"])
P.S. Take a look on comment left by @edd under your question.