I’m a very beginner of Python.
I tried to make some web scraper (especially PubMed).
Using my code, I want to print the result which has not only the title of papers, but doi (or any accession links of the paper) like below.
Title: ABCD ABCD ABCD ABCD [http:// ~~~~]
Title: ABCD ABCD ABCD ABCD [http:// ~~~~]
Title: ABCD ABCD ABCD ABCD [http:// ~~~~]
….
But, in the final stage,
I can not show the title and links, simultaneously.
When I print each factors, respectively, it works.
Also, I don’t know exactly how to use ‘for’.
I really appreciate for your consideration of my question.
Thanks.
import requests from bs4 import BeautifulSoup from pprint import pprint search = str(input("Search: ")) arttype = str(input("Is ir Review ? (y/n): ")) perpage = str(input("How many results do you want ? (10/20/50/100/200): ")) sort = str(input("Which options do you want ? (date/match): ")) if arttype == "y": arttype_in = "&filter=pubt.review" else: arttype_in = "" if sort == "data": sort2 = "&sort=data" else: sort2 = "" url = "https://pubmed.ncbi.nlm.nih.gov/?term=" + search + arttype_in + "&format=abstract" + sort2 + "&size=" + perpage req = requests.get(url) html = req.text status = req.status_code if status != 200: print ("") else: print ("Stuck") soup = BeautifulSoup(html, "html.parser") contain_amount = soup.find ("div", {"class":"search-results"}) specific_amount = contain_amount.find ("div", {"class":"results-amount"}).text print("Number of papers: " + str(specific_amount)) list_titles = soup.find_all ("div", {"class":"short-view"}) list_dois = soup.find_all ("a", {"class":"link-item dialog-focus"}) for i in list_dois: for j in list_titles: titles = j.find ("h1", {"class":"heading-title"}).text print ("Title: " + str(titles)) dois = i.attrs["href"] print ("[" + str(dois) + "]")
Advertisement
Answer
Change the selectors. Half of your code is correct
import requests from bs4 import BeautifulSoup from pprint import pprint search = str(input("Search: ")) arttype = str(input("Is ir Review ? (y/n): ")) perpage = str(input("How many results do you want ? (10/20/50/100/200): ")) sort = str(input("Which options do you want ? (date/match): ")) if arttype == "y": arttype_in = "&filter=pubt.review" else: arttype_in = "" if sort == "data": sort2 = "&sort=data" else: sort2 = "" url = "https://pubmed.ncbi.nlm.nih.gov/?term=" + search + arttype_in + "&format=abstract" + sort2 + "&size=" + perpage print(url) req = requests.get(url) html = req.text status = req.status_code if status != 200: print ("Stuck") soup = BeautifulSoup(html, "html.parser") search_divs = soup.find_all("div", class_="results-article") for div in search_divs: print("Title - {}".format(div.find("h1", class_="heading-title").get_text(strip=True))) print("Link - {}".format("https://pubmed.ncbi.nlm.nih.gov" + div.find("a")["href"])) print("---" * 25) print("Number of papers - {}".format(soup.find("div", class_="results-amount").get_text(strip=True)))
Output:
Search: corona Is ir Review ? (y/n): n How many results do you want ? (10/20/50/100/200): 20 Which options do you want ? (date/match): match https://pubmed.ncbi.nlm.nih.gov/?term=corona&format=abstract&size=20 Title - The history and epidemiology of Middle East respiratory syndrome corona virus Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Multidiscip+Respir+Med%22%5Bjour%5D --------------------------------------------------------------------------- Title - Personalized protein corona on nanoparticles and its clinical implications Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Biomater+Sci%22%5Bjour%5D --------------------------------------------------------------------------- Title - Nanoparticle-Protein Interaction: The Significance and Role of Protein Corona Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Adv+Exp+Med+Biol%22%5Bjour%5D --------------------------------------------------------------------------- Title - Gold nanoparticle should understand protein corona for being a clinical nanomaterial Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22J+Control+Release%22%5Bjour%5D --------------------------------------------------------------------------- Title - The impact of protein corona on the behavior and targeting capability of nanoparticle-based delivery system Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Int+J+Pharm%22%5Bjour%5D --------------------------------------------------------------------------- Title - Liposome protein corona characterization as a new approach in nanomedicine Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Anal+Bioanal+Chem%22%5Bjour%5D --------------------------------------------------------------------------- Title - Shell-corona microgels from double interpenetrating networks Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Soft+Matter%22%5Bjour%5D --------------------------------------------------------------------------- Title - Protein corona: Opportunities and challenges Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Int+J+Biochem+Cell+Biol%22%5Bjour%5D --------------------------------------------------------------------------- Title - Biomolecular Corona Dictates Aβ Fibrillation Process Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22ACS+Chem+Neurosci%22%5Bjour%5D --------------------------------------------------------------------------- Title - A health concern regarding the protein corona, aggregation and disaggregation Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Biochim+Biophys+Acta+Gen+Subj%22%5Bjour%5D --------------------------------------------------------------------------- Title - Formation and Characterization of Protein Corona Around Nanoparticles: A Review Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22J+Nanosci+Nanotechnol%22%5Bjour%5D --------------------------------------------------------------------------- Title - Silver nanoparticle protein corona and toxicity: a mini-review Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22J+Nanobiotechnology%22%5Bjour%5D --------------------------------------------------------------------------- Title - The prevalence and morphology of the corona mortis (Crown of death): A meta-analysis with implications in abdominal wall and pelvic surgery Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Injury%22%5Bjour%5D --------------------------------------------------------------------------- Title - Possibilities and Limitations of Different Separation Techniques for the Analysis of the Protein Corona Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Angew+Chem+Int+Ed+Engl%22%5Bjour%5D --------------------------------------------------------------------------- Title - Translating Current Bioanalytical Techniques for Studying Corona Activity Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Trends+Biotechnol%22%5Bjour%5D --------------------------------------------------------------------------- Title - The Crown and the Scepter: Roles of the Protein Corona in Nanomedicine Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Adv+Mater%22%5Bjour%5D --------------------------------------------------------------------------- Title - Protein corona - from molecular adsorption to physiological complexity Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Beilstein+J+Nanotechnol%22%5Bjour%5D --------------------------------------------------------------------------- Title - Understanding the nanoparticle-protein corona complexes using computational and experimental methods Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Int+J+Biochem+Cell+Biol%22%5Bjour%5D --------------------------------------------------------------------------- Title - Structure of corona radiata and tapetum fibers in ventricular surgery Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22J+Clin+Neurosci%22%5Bjour%5D --------------------------------------------------------------------------- Title - A protein corona primer for physical chemists Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22J+Chem+Phys%22%5Bjour%5D --------------------------------------------------------------------------- Number of papers - 954results