hi there i am currently working on a little tiny sraper – and i am putting some pieces together i have an URL which holds record of so called digital hubs: see https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view
i want to export the 700 regords in (to) a csv-format: that is -into a excel-spreadsheet. so far so good:
i have made some first experiments – which look pretty nice.
see
# Python program to print all heading tags import requests from bs4 import BeautifulSoup # scraping a the content url_link = 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view' request = requests.get(url_link) Soup = BeautifulSoup(request.text, 'lxml') # creating a list of all common heading tags heading_tags = ["h1", "h2", "h3", "h4"] for tags in Soup.find_all(heading_tags): print(tags.name + ' -> ' + tags.text.strip())
which delivers:
h1 -> Smart Specialisation Platform h1 -> Digital Innovation Hubs Digital Innovation Hubs h2 -> Bavarian Robotic Network (BaRoN) Bayerisches Robotik-Netzwerk, BaRoN h4 -> Contact Data h4 -> Description h4 -> Link to national or regional initiatives for digitising industry h4 -> Market and Services h4 -> Organization h4 -> Evolutionary Stage h4 -> Geographical Scope h4 -> Funding h4 -> Partners h4 -> Technologies
i want to get the data set : https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool: i need to itterate 700 urls see
https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/17865/view https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1416/view
Well for the itterating-part i think i can go this way to iterate through multiple URLs, which I can define in advance, using Requests and BeautifulSoup: Attached is what I have so far, i.e. trying to put the URls in a list….
import requests import bs4 URLs = ["https://example-url-1.com", "https://example-url-2.com"] result = requests.get(URLs) soup = bs4.BeautifulSoup(result.text,"lxml") print(soup.find_all('p'))
Well – to be frank: I’m also looking for a way to include a interval delay so as not to SPAM the server. So i could go so:
import requests import bs4 import sleep from time URLs = ['https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view', 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/17865/view', 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1416/view' ] def getPage(url): print('Indexing {0}......'.format(url)) result = requests.get(url) print('Url Indexed...Now pausing 50secs before next ') sleep(50) return result results = map(getPage, URLs) for result in results: # soup = bs4.BeautifulSoup(result.text,"lxml") soup = bs4.BeautifulSoup(result.text,"html.parser") print(soup.find_all('p'))
and now for the parser-part: parse the data eg from here as a example: https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view
import requests from bs4 import BeautifulSoup url = 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view' r = requests.get(url) html_as_string = r.text soup = BeautifulSoup(html_as_string, 'html.parser') for link in soup.find_all('p'): print (link.text)
resultat- pretty awesome but unsorted – i want to store all the results in a csv-format – that is into a excelsheet with the following columns:
Digital Innovation Hubs h2 -> Bavarian Robotic Network (BaRoN) Bayerisches Robotik-Netzwerk, BaRoN h4 -> Contact Data h4 -> Description h4 -> Link to national or regional initiatives for digitising industry h4 -> Market and Services h4 -> Organization h4 -> Evolutionary Stage h4 -> Geographical Scope h4 -> Funding h4 -> Partners h4 -> Technologies
see
import requests from bs4 import BeautifulSoup url = 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view' r = requests.get(url) html_as_string = r.text soup = BeautifulSoup(html_as_string, 'html.parser') for link in soup.find_all('p'): print (link.text)
see the results:
Click on the following link if you want to propose a change of this HUB
You need an EU Login account for request proposals for editions or creations of new hubs. If you already have an ECAS account, you don’t have to create a new EU Login account. In EU Login, your credentials and personal data remain unchanged. You can still access the same services and applications as before. You just need to use your e-mail address for logging in. If you don’t have an EU Login account please use the following link. you can create one by clicking on the Create an account hyperlink. If you already have a user account for EU Login please login via https://webgate.ec.europa.eu/cas/login Sign in New user? Create an account Coordinator (University) Robotic Competence Center of Technical University of Munich, TUM CC Coordinator website http://www6.in.tum.de/en/home/ Year Established 2017 Location Schleißheimer Str. 90a, 85748, Garching bei München (Germany) Website http://www.robot.bayern Social Media
Contact information Adam Schmidt adam.schmidt@tum.de +49 (0)89 289-18064
Year Established 2017 Location Schleißheimer Str. 90a, 85748, Garching bei München (Germany) Website http://www.robot.bayern Social Media
Contact information Description BaRoN is an initiative bringing together several actors in Bavaria: the TUM Robotics Competence Center founded within the HORSE project, Bavarian Research Alliance (BayFOR), ITZB (Projektträger Bayern) and Bayerische Patentallianz, the latter three being members of the Bavarian Research and Innovation Agency) in order to facilitate the process of robotizing Bavarian manufacturing sector. In its current form it is an informal alliance of established institutions with a vast experience in the field of bringing and facilitating innovation in Bavaria. The mission of the network is to make Bavaria the forerunner of the digitalized and robotized European industry. The mission is realized by offering services ranging from providing the technological expertise, access to the robotic equipment, IPR advice and management, and funding facilitation to various entities of the Bavarian manufacturing ecosystem – start-ups, SMEs, research institutes, universities and other institutions interested in embracing the Industry 4.0 revolution. BaRoN verbindet mehrere Bayerische Akteure mit einem gemeinsamen Ziel – die Robotisierung des Bayerischen produzierenden Gewerbes voranzutreiben. OP Bayern ERDF 2014-2020 Enhancing the competitiveness of SMEs through the creation and the extension of advanced capacities for product and service developments and through internationalisation initiatives Budget 1.4B€
well to sume up: it am getting back a awful and unsorted datachunk with BS4 – how to clean up with Pandas so that i have a cean table with the following columns
h1 -> Smart Specialisation Platform h1 -> Digital Innovation Hubs Digital Innovation Hubs h2 -> Bavarian Robotic Network (BaRoN) Bayerisches Robotik-Netzwerk, BaRoN h4 -> Contact Data h4 -> Description h4 -> Link to national or regional initiatives for digitising industry h4 -> Market and Services h4 -> Organization h4 -> Evolutionary Stage h4 -> Geographical Scope h4 -> Funding h4 -> Partners h4 -> Technologies
update: thanks to Tim Roberts i have seen that we have the f ollowing combination
class: hubCard class: hubCardTitle class: hubCardContent class: infoLabel >Description> <p> Text - data - content <p>
with this – we can extend the parsing job step by step. Many thanks to you Tim!
that said – i am just trying to get the data for the other fields of interest – eg. the description text which starts like so:
BaRoN is an initiative bringing together several actors in Bavaria: the TUM Robotics Competence Center founded within the HORSE project, Bavarian Research Alliance (BayFOR), ITZB (Projektträger Bayern)
I applied the ideas of you, Tim to the code. – but it does not work
import requests from bs4 import BeautifulSoup from pprint import pprint url = 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view' r = requests.get(url) soup = BeautifulSoup(r.text, 'html5lib') # The name of the hub is in the <h4> tag. hubname = soup.find('h4').text # All contact info is within a <div class='hubCard'>. description = soup.find("div", class_="hubCardContent") cardinfo = {} # Grab all the <p> tags inside that div. The infoLabel class marks # the section header. for data in description.find_all('p'): if 'infoLabel' in data.attrs.get('class', []): Description = data.text cardinfo[Description] = [] else: cardinfo[Description].append( data.text ) # The contact info is in a <div> inside that div. #for data in contact.find_all('div', class_='infoMultiValue'): # cardinfo['Description'].append( data.text) print("---") print(hubname) print("---") pprint(cardinfo)
it allways gives back the content information – but not the data that i am looking for – the text of the description: i am doing something wrong…
Advertisement
Answer
Maybe this can give you a start. You HAVE to dig into the HTML to find the key markers for the information you want. I’m sensing that you want the title, and the contact information. The title is in an <h2>
tag, the only such tag on the page. The contact info is within a <div class='hubCard'>
tag, so we can grab that and pull out the pieces.
import requests from bs4 import BeautifulSoup from pprint import pprint url = 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view' r = requests.get(url) soup = BeautifulSoup(r.text, 'html5lib') # The name of the hub is in the <h2> tag. hubname = soup.find('h2').text # All contact info is within a <div class='hubCard'>. contact = soup.find("div", class_="hubCard") cardinfo = {} # Grab all the <p> tags inside that div. The infoLabel class marks # the section header. for data in contact.find_all('p'): if 'infoLabel' in data.attrs.get('class', []): title = data.text cardinfo[title] = [] else: cardinfo[title].append( data.text ) # The contact info is in a <div> inside that div. for data in contact.find_all('div', class_='infoMultiValue'): cardinfo['Contact information'].append( data.text) print("---") print(hubname) print("---") pprint(cardinfo)
Output:
--- Bavarian Robotic Network (BaRoN) Bayerisches Robotik-Netzwerk, BaRoN --- {'Contact information': [' Adam Schmidt', ' adam.schmidt@tum.de', ' +49 (0)89 289-18064'], 'Coordinator (University)': ['', 'Robotic Competence Center of Technical ' 'University of Munich, TUM CC'], 'Coordinator website': ['http://www6.in.tum.de/en/home/n' 'tttttn' 'tttttn' 'ttttt'], 'Location': ['Schleißheimer Str. 90a, 85748, Garching bei München (Germany)'], 'Social Media': ['n' 'ttttttn' 'tttttttttt n' 'ttttttttttn' 'tttttttttt n' 'ttttttttttn' 'tttttttttt n' 'ttttttttttn' 'tttttt'], 'Website': ['http://www.robot.bayern'], 'Year Established': ['2017']}