Python

hi there i am currently working on a little tiny sraper – and i am putting some pieces together i have an URL which holds record of so called digital hubs: see https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view

i want to export the 700 regords in (to) a csv-format: that is -into a excel-spreadsheet. so far so good:

i have made some first experiments – which look pretty nice.

see

# Python program to print all heading tags
import requests
from bs4 import BeautifulSoup
 
# scraping a the content
url_link = 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view'
request = requests.get(url_link)
 
Soup = BeautifulSoup(request.text, 'lxml')
 
# creating a list of all common heading tags
heading_tags = ["h1", "h2", "h3", "h4"]
for tags in Soup.find_all(heading_tags):
    print(tags.name + ' -> ' + tags.text.strip())

which delivers:

h1 -> Smart Specialisation Platform
h1 -> Digital Innovation Hubs

Digital Innovation Hubs
h2 -> Bavarian Robotic Network (BaRoN) Bayerisches Robotik-Netzwerk, BaRoN
h4 -> Contact Data
h4 -> Description
h4 -> Link to national or regional initiatives for digitising industry
h4 -> Market and Services
h4 -> Organization
h4 -> Evolutionary Stage
h4 -> Geographical Scope
h4 -> Funding
h4 -> Partners
h4 -> Technologies

i want to get the data set : https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool: i need to itterate 700 urls see

https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/17865/view https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1416/view

Well for the itterating-part i think i can go this way to iterate through multiple URLs, which I can define in advance, using Requests and BeautifulSoup: Attached is what I have so far, i.e. trying to put the URls in a list….

import requests
import bs4
URLs = ["https://example-url-1.com", "https://example-url-2.com"]
result = requests.get(URLs)
soup = bs4.BeautifulSoup(result.text,"lxml")

print(soup.find_all('p'))

Well – to be frank: I’m also looking for a way to include a interval delay so as not to SPAM the server. So i could go so:

import requests
import bs4
import sleep from time
URLs = ['https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view',
        'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/17865/view',
        'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1416/view'
        ]

def getPage(url):
    print('Indexing {0}......'.format(url))
    result = requests.get(url)
    print('Url Indexed...Now pausing 50secs before next ')
    sleep(50)
    return result

results = map(getPage, URLs)
for result in results:
    # soup = bs4.BeautifulSoup(result.text,"lxml")
    soup = bs4.BeautifulSoup(result.text,"html.parser")
    print(soup.find_all('p'))

and now for the parser-part: parse the data eg from here as a example: https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view

import requests
from bs4 import BeautifulSoup

url = 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view'
r = requests.get(url)
html_as_string = r.text
soup = BeautifulSoup(html_as_string, 'html.parser')
for link in soup.find_all('p'):
    print (link.text)

resultat- pretty awesome but unsorted – i want to store all the results in a csv-format – that is into a excelsheet with the following columns:

Digital Innovation Hubs h2 -> Bavarian Robotic Network (BaRoN) Bayerisches Robotik-Netzwerk, BaRoN h4 -> Contact Data h4 -> Description h4 -> Link to national or regional initiatives for digitising industry h4 -> Market and Services h4 -> Organization h4 -> Evolutionary Stage h4 -> Geographical Scope h4 -> Funding h4 -> Partners h4 -> Technologies

see

import requests
from bs4 import BeautifulSoup

url = 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view'
r = requests.get(url)
html_as_string = r.text
soup = BeautifulSoup(html_as_string, 'html.parser')
for link in soup.find_all('p'):
    print (link.text)

see the results:

Click on the following link if you want to propose a change of this HUB

You need an EU Login account for request proposals for editions or creations of new hubs. If you already have an ECAS account, you don’t have to create a new EU Login account. In EU Login, your credentials and personal data remain unchanged. You can still access the same services and applications as before. You just need to use your e-mail address for logging in. If you don’t have an EU Login account please use the following link. you can create one by clicking on the Create an account hyperlink. If you already have a user account for EU Login please login via https://webgate.ec.europa.eu/cas/login Sign in New user? Create an account Coordinator (University) Robotic Competence Center of Technical University of Munich, TUM CC Coordinator website http://www6.in.tum.de/en/home/ Year Established 2017 Location Schleißheimer Str. 90a, 85748, Garching bei München (Germany) Website http://www.robot.bayern Social Media

Contact information Adam Schmidt adam.schmidt@tum.de +49 (0)89 289-18064

Year Established 2017 Location Schleißheimer Str. 90a, 85748, Garching bei München (Germany) Website http://www.robot.bayern Social Media

Contact information Description BaRoN is an initiative bringing together several actors in Bavaria: the TUM Robotics Competence Center founded within the HORSE project, Bavarian Research Alliance (BayFOR), ITZB (Projektträger Bayern) and Bayerische Patentallianz, the latter three being members of the Bavarian Research and Innovation Agency) in order to facilitate the process of robotizing Bavarian manufacturing sector. In its current form it is an informal alliance of established institutions with a vast experience in the field of bringing and facilitating innovation in Bavaria. The mission of the network is to make Bavaria the forerunner of the digitalized and robotized European industry. The mission is realized by offering services ranging from providing the technological expertise, access to the robotic equipment, IPR advice and management, and funding facilitation to various entities of the Bavarian manufacturing ecosystem – start-ups, SMEs, research institutes, universities and other institutions interested in embracing the Industry 4.0 revolution. BaRoN verbindet mehrere Bayerische Akteure mit einem gemeinsamen Ziel – die Robotisierung des Bayerischen produzierenden Gewerbes voranzutreiben. OP Bayern ERDF 2014-2020 Enhancing the competitiveness of SMEs through the creation and the extension of advanced capacities for product and service developments and through internationalisation initiatives Budget 1.4B€

well to sume up: it am getting back a awful and unsorted datachunk with BS4 – how to clean up with Pandas so that i have a cean table with the following columns

h1 -> Smart Specialisation Platform
h1 -> Digital Innovation Hubs

Digital Innovation Hubs
h2 -> Bavarian Robotic Network (BaRoN) Bayerisches Robotik-Netzwerk, BaRoN
h4 -> Contact Data
h4 -> Description
h4 -> Link to national or regional initiatives for digitising industry
h4 -> Market and Services
h4 -> Organization
h4 -> Evolutionary Stage
h4 -> Geographical Scope
h4 -> Funding
h4 -> Partners
h4 -> Technologies

update: thanks to Tim Roberts i have seen that we have the f ollowing combination

class: hubCard
class: hubCardTitle
class: hubCardContent
class: infoLabel >Description>
<p> Text - data - content <p>

with this – we can extend the parsing job step by step. Many thanks to you Tim!

that said – i am just trying to get the data for the other fields of interest – eg. the description text which starts like so:

BaRoN is an initiative bringing together several actors in Bavaria: the TUM Robotics Competence Center founded within the HORSE project, Bavarian Research Alliance (BayFOR), ITZB (Projektträger Bayern)

I applied the ideas of you, Tim to the code. – but it does not work

import requests
from bs4 import BeautifulSoup
from pprint import pprint

url = 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html5lib')

# The name of the hub is in the <h4> tag.

hubname = soup.find('h4').text

# All contact info is within a <div class='hubCard'>.

description = soup.find("div", class_="hubCardContent")

cardinfo = {}

# Grab all the <p> tags inside that div.  The infoLabel class marks
# the section header.

for data in description.find_all('p'):
    if 'infoLabel' in data.attrs.get('class', []):
        Description = data.text
        cardinfo[Description] = []
    else:
        cardinfo[Description].append( data.text )

# The contact info is in a <div> inside that div.

#for data in contact.find_all('div', class_='infoMultiValue'):
#    cardinfo['Description'].append( data.text)

print("---")
print(hubname)
print("---")
pprint(cardinfo)

it allways gives back the content information – but not the data that i am looking for – the text of the description: i am doing something wrong…

Answer

Maybe this can give you a start. You HAVE to dig into the HTML to find the key markers for the information you want. I’m sensing that you want the title, and the contact information. The title is in an <h2> tag, the only such tag on the page. The contact info is within a <div class='hubCard'> tag, so we can grab that and pull out the pieces.

import requests
from bs4 import BeautifulSoup
from pprint import pprint

url = 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html5lib')

# The name of the hub is in the <h2> tag.

hubname = soup.find('h2').text

# All contact info is within a <div class='hubCard'>.

contact = soup.find("div", class_="hubCard")

cardinfo = {}

# Grab all the <p> tags inside that div.  The infoLabel class marks
# the section header.

for data in contact.find_all('p'):
    if 'infoLabel' in data.attrs.get('class', []):
        title = data.text
        cardinfo[title] = []
    else:
        cardinfo[title].append( data.text )

# The contact info is in a <div> inside that div.

for data in contact.find_all('div', class_='infoMultiValue'):
    cardinfo['Contact information'].append( data.text)

print("---")
print(hubname)
print("---")
pprint(cardinfo)

Output:

---
 Bavarian Robotic Network (BaRoN) Bayerisches Robotik-Netzwerk, BaRoN
---
{'Contact information': [' Adam Schmidt',
                         ' adam.schmidt@tum.de',
                         ' +49 (0)89 289-18064'],
 'Coordinator (University)': ['',
                              'Robotic Competence Center of Technical '
                              'University of Munich, TUM CC'],
 'Coordinator website': ['http://www6.in.tum.de/en/home/n'
                         'tttttn'
                         'tttttn'
                         'ttttt'],
 'Location': ['Schleißheimer Str. 90a, 85748, Garching bei München (Germany)'],
 'Social Media': ['n'
                  'ttttttn'
                  'tttttttttt n'
                  'ttttttttttn'
                  'tttttttttt n'
                  'ttttttttttn'
                  'tttttttttt n'
                  'ttttttttttn'
                  'tttttt'],
 'Website': ['http://www.robot.bayern'],
 'Year Established': ['2017']}

itterate through multiple URLs with BS4 – and store results into a csv-format

Advertisement

Answer