Skip to content
Advertisement

itterate through multiple URLs with BS4 – and store results into a csv-format

hi there i am currently working on a little tiny sraper – and i am putting some pieces together i have an URL which holds record of so called digital hubs: see https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view

i want to export the 700 regords in (to) a csv-format: that is -into a excel-spreadsheet. so far so good:

i have made some first experiments – which look pretty nice.

see

JavaScript

which delivers:

JavaScript

i want to get the data set : https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool: i need to itterate 700 urls see enter image description here

https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/17865/view https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1416/view

Well for the itterating-part i think i can go this way to iterate through multiple URLs, which I can define in advance, using Requests and BeautifulSoup: Attached is what I have so far, i.e. trying to put the URls in a list….

JavaScript

Well – to be frank: I’m also looking for a way to include a interval delay so as not to SPAM the server. So i could go so:

JavaScript

and now for the parser-part: parse the data eg from here as a example: https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view

JavaScript

resultat- pretty awesome but unsorted – i want to store all the results in a csv-format – that is into a excelsheet with the following columns:

Digital Innovation Hubs h2 -> Bavarian Robotic Network (BaRoN) Bayerisches Robotik-Netzwerk, BaRoN h4 -> Contact Data h4 -> Description h4 -> Link to national or regional initiatives for digitising industry h4 -> Market and Services h4 -> Organization h4 -> Evolutionary Stage h4 -> Geographical Scope h4 -> Funding h4 -> Partners h4 -> Technologies

see

JavaScript

see the results:

Click on the following link if you want to propose a change of this HUB

You need an EU Login account for request proposals for editions or creations of new hubs. If you already have an ECAS account, you don’t have to create a new EU Login account. In EU Login, your credentials and personal data remain unchanged. You can still access the same services and applications as before. You just need to use your e-mail address for logging in. If you don’t have an EU Login account please use the following link. you can create one by clicking on the Create an account hyperlink. If you already have a user account for EU Login please login via https://webgate.ec.europa.eu/cas/login Sign in New user? Create an account Coordinator (University) Robotic Competence Center of Technical University of Munich, TUM CC Coordinator website http://www6.in.tum.de/en/home/ Year Established 2017 Location Schleißheimer Str. 90a, 85748, Garching bei München (Germany) Website http://www.robot.bayern Social Media

Contact information Adam Schmidt adam.schmidt@tum.de +49 (0)89 289-18064

Year Established 2017 Location Schleißheimer Str. 90a, 85748, Garching bei München (Germany) Website http://www.robot.bayern Social Media

Contact information Description BaRoN is an initiative bringing together several actors in Bavaria: the TUM Robotics Competence Center founded within the HORSE project, Bavarian Research Alliance (BayFOR), ITZB (Projektträger Bayern) and Bayerische Patentallianz, the latter three being members of the Bavarian Research and Innovation Agency) in order to facilitate the process of robotizing Bavarian manufacturing sector. In its current form it is an informal alliance of established institutions with a vast experience in the field of bringing and facilitating innovation in Bavaria. The mission of the network is to make Bavaria the forerunner of the digitalized and robotized European industry. The mission is realized by offering services ranging from providing the technological expertise, access to the robotic equipment, IPR advice and management, and funding facilitation to various entities of the Bavarian manufacturing ecosystem – start-ups, SMEs, research institutes, universities and other institutions interested in embracing the Industry 4.0 revolution. BaRoN verbindet mehrere Bayerische Akteure mit einem gemeinsamen Ziel – die Robotisierung des Bayerischen produzierenden Gewerbes voranzutreiben. OP Bayern ERDF 2014-2020 Enhancing the competitiveness of SMEs through the creation and the extension of advanced capacities for product and service developments and through internationalisation initiatives Budget 1.4B€

well to sume up: it am getting back a awful and unsorted datachunk with BS4 – how to clean up with Pandas so that i have a cean table with the following columns

JavaScript

update: thanks to Tim Roberts i have seen that we have the f ollowing combination

JavaScript

with this – we can extend the parsing job step by step. Many thanks to you Tim!

enter image description here

that said – i am just trying to get the data for the other fields of interest – eg. the description text which starts like so:

BaRoN is an initiative bringing together several actors in Bavaria: the TUM Robotics Competence Center founded within the HORSE project, Bavarian Research Alliance (BayFOR), ITZB (Projektträger Bayern)

I applied the ideas of you, Tim to the code. – but it does not work

JavaScript

it allways gives back the content information – but not the data that i am looking for – the text of the description: i am doing something wrong…

Advertisement

Answer

Maybe this can give you a start. You HAVE to dig into the HTML to find the key markers for the information you want. I’m sensing that you want the title, and the contact information. The title is in an <h2> tag, the only such tag on the page. The contact info is within a <div class='hubCard'> tag, so we can grab that and pull out the pieces.

JavaScript

Output:

JavaScript
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement