I have a HTML document where I want to extract the address but I’m unable to. Here is the HTML document. It contains an address that is not enclosed with brackets, and a beginner like me is not able to extract it without it (e.g. with find()
or similar).
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Title</title> </head> <body> <table class="novip"> <tr class="novip"> <td class="novip-portrait-picture" rowspan="5"> <a class="novip" href="refer.html">URL</a> </td> <td class="novip-left"> <a class="novip-firmen-name" href="refer.html" target="_top"> John Doe </a> </td> <td class="novip-right" rowspan="2"> <a class="novip" href="refer.html">URL</a> </td> </tr> <tr class="novip"> <td class="novip-left"> <span class="novip-left-titel"> Prof. </span> <span class="novip-left-fachbezeichnung"> Professor for History </span> <br/> Rose Avenue 33, 4302843 A City <br/> Tel: <a>234 23 43244</a> <a class="novip-left-make_appointment-button-active">Booking</a> </td> </tr> </table> </body> </html>
I would like to extract the address Rose Avenue 33, 4302843 A City
.
Here is my attempt but I cannot narrow it down enough.
from bs4 import BeautifulSoup r = requests.get(url) r.encoding = 'utf8' html_doc = r.text soup = BeautifulSoup(html_doc, features='html5lib') table = [] tables = soup.find_all("table", {"class": "novip"}) for table in tables: rows = table.findChildren('tr') address = rows[1].find('span', 'novip-left-fachbezeichnung').text
Advertisement
Answer
The following code will approximate your attempt. It’s based on bs4 (BeautifulSoup), pandas and requests:
import requests from bs4 import BeautifulSoup import pandas as pd url = 'https://www.doktor.ch/gynaekologen/gynaekologen_k_lu.html' r = requests.get(url) soup = BeautifulSoup(r.text, 'html.parser') dr_list = [] doctor_cards = soup.select('table.novip') for card in doctor_cards: try: dr_name = card.select_one('a.novip-firmen-name').text.strip() except Exception as e: dr_name = 'No Name' try: dr_url = card.select_one('a').get('href') except Exception as e: dr_url = 'No Url' try: dr_title = card.select_one('span.novip-left-titel').text.strip() except Exception as e: dr_title = 'No title' try: dr_specialisation = card.select_one('span.novip-left-fachbezeichnung').text.strip() except Exception as e: dr_specialisation = 'No specialisation' try: dr_address_span = card.select_one('span.novip-left-adresszusatz') dr_address = dr_address_span.text.strip() + ' ' + dr_address_span.next_sibling.strip() except Exception as e: dr_address_span = 'No address' if len(card.select_one('span.novip-left-fachbezeichnung').next_sibling.strip()) > 5: dr_address = card.select_one('span.novip-left-fachbezeichnung').next_sibling.strip().replace('n', ' ') elif len(card.select_one('span.novip-left-fachbezeichnung').next_sibling.next_sibling) > 5: dr_address = card.select_one('span.novip-left-fachbezeichnung').next_sibling.next_sibling.text.strip().replace('n', ' ') else: dr_address = card.select_one('span.novip-left-fachbezeichnung').next_sibling.next_sibling.next_sibling.strip().replace('n', ' ') dr_list.append((dr_name, dr_title, dr_specialisation, dr_address)) df = pd.DataFrame(dr_list, columns = ['Name', 'Title', 'Spec', 'Address']) df.to_csv('swiss_docs.csv') print(df.head())
This will save a csv file with dr details, looking like this:
Name Title Spec Address 0 Wey Barbara Dr. med. Fachärztin FMH für Gynäkologie u. Geburtshilfe Hauptstrasse 12, 6033 Buchrain Tel: 041 444 30 80 Terminanfrage Karte 1 Bohl Urs Dr. med. Facharzt FMH für Gynäkologie und Geburtshilfe Seetalstrasse 11, 6020 Emmenbrücke 2 Füchsel Glenn Dr. med. Facharzt für Gynäkologie und Geburtshilfe docstation Gesundheitszentrum Emmen Mooshüslistrasse 6, 6032 Emmen 3 Dal Pian Désirée Dr. med. Fachärztin FMH für Gynäkologie u. Geburtshilfe Frauenpraxis Zero Plus Am Mattenhof 4a, 6010 Kriens 4 Gilke Ursula Dr. med. Fachärztin für Gynäkologie u. Geburtshilfe Schachenstrasse 5, 6010 Kriens 5 Amann Stefanie Dr. med. Fachärztin FMH Gynäkologie u. Geburtshilfe Frauenpraxis am See Alpenstrasse 1, 6004 Luzern 6 Ballabio Nadja Dr. med. Fachärztin FMH Gynäkologie und Geburtshilfe gyn-zentrum ag Haldenstrasse 11, 6006 Luzern [...]
There are better, more elegant solutions out there. Have a look over the bs4 documentation, at https://www.crummy.com/software/BeautifulSoup/bs4/doc/