I am trying to scrape web results from the website: https://promedmail.org/promed-posts/
I have followed beutifulsoup. mechanical soup and mechanize so far unable to scrape the search results.
import re from mechanize import Browser,urlopen browser = Browser() browser.set_handle_robots(False) browser.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')] browser.open("https://promedmail.org/promed-posts") for form in browser.forms(): if form.attrs['id'] == 'full_search': browser.form = form break browser['search'] = 'US' response = browser.submit() content = response.read()
The content does not show the search results when typed in US. Any idea on what am I doing wrong here?
Advertisement
Answer
As you mention bs4 you can mimic the POST request the page makes. Extract the json item which contains the html the page would have been updated with (containing the results); parse that into BeautifulSoup object then reconstruct the results table as a dataframe:
import requests from bs4 import BeautifulSoup as bs headers = {'user-agent': 'Mozilla/5.0'} data = { 'action': 'get_promed_search_content', 'query[0][name]': 'kwby1', 'query[0][value]': 'summary', 'query[1][name]': 'search', 'query[1][value]': 'US', 'query[2][name]': 'date1', # 'query[2][value]': '', 'query[3][name]': 'date2', # 'query[3][value]': '', 'query[4][name]': 'feed_id', 'query[4][value]': '1' } r = requests.post('https://promedmail.org/wp-admin/admin-ajax.php', headers=headers, data=data).json() soup = bs(r['results'], 'lxml') df = pd.DataFrame([(i.find_next(text=True), i.a.text, f"https://promedmail.org/promed-post/?id={i.a['id'].replace('id','')}") for i in soup.select('li')] , columns = ['Date', 'Title', 'Link']) print(df)