I have a HTML document where I want to extract the address but I’m unable to. Here is the HTML document. It contains an address that is not enclosed with brackets, and a beginner like me is not able to extract it without it (e.g. with find()
or similar).
JavaScript
x
48
48
1
<!DOCTYPE html>
2
<html lang="en">
3
<head>
4
<meta charset="UTF-8">
5
<title>Title</title>
6
</head>
7
<body>
8
<table class="novip">
9
<tr class="novip">
10
<td class="novip-portrait-picture"
11
rowspan="5">
12
<a class="novip" href="refer.html">URL</a>
13
</td>
14
<td class="novip-left">
15
<a class="novip-firmen-name"
16
href="refer.html"
17
target="_top">
18
John Doe
19
</a>
20
</td>
21
<td class="novip-right"
22
rowspan="2">
23
<a class="novip" href="refer.html">URL</a>
24
</td>
25
</tr>
26
<tr class="novip">
27
<td class="novip-left">
28
<span class="novip-left-titel">
29
Prof.
30
</span>
31
<span class="novip-left-fachbezeichnung">
32
Professor for History
33
</span>
34
<br/>
35
Rose Avenue 33, 4302843 A City
36
<br/>
37
Tel: <a>234 23 43244</a>
38
39
<a class="novip-left-make_appointment-button-active">Booking</a>
40
41
</td>
42
</tr>
43
44
</table>
45
46
</body>
47
</html>
48
I would like to extract the address Rose Avenue 33, 4302843 A City
.
Here is my attempt but I cannot narrow it down enough.
JavaScript
1
16
16
1
from bs4 import BeautifulSoup
2
3
4
r = requests.get(url)
5
r.encoding = 'utf8'
6
html_doc = r.text
7
soup = BeautifulSoup(html_doc, features='html5lib')
8
table = []
9
10
tables = soup.find_all("table", {"class": "novip"})
11
12
for table in tables:
13
rows = table.findChildren('tr')
14
15
address = rows[1].find('span', 'novip-left-fachbezeichnung').text
16
Advertisement
Answer
The following code will approximate your attempt. It’s based on bs4 (BeautifulSoup), pandas and requests:
JavaScript
1
44
44
1
import requests
2
from bs4 import BeautifulSoup
3
import pandas as pd
4
5
url = 'https://www.doktor.ch/gynaekologen/gynaekologen_k_lu.html'
6
7
r = requests.get(url)
8
soup = BeautifulSoup(r.text, 'html.parser')
9
dr_list = []
10
doctor_cards = soup.select('table.novip')
11
for card in doctor_cards:
12
try:
13
dr_name = card.select_one('a.novip-firmen-name').text.strip()
14
except Exception as e:
15
dr_name = 'No Name'
16
try:
17
dr_url = card.select_one('a').get('href')
18
except Exception as e:
19
dr_url = 'No Url'
20
try:
21
dr_title = card.select_one('span.novip-left-titel').text.strip()
22
except Exception as e:
23
dr_title = 'No title'
24
try:
25
dr_specialisation = card.select_one('span.novip-left-fachbezeichnung').text.strip()
26
except Exception as e:
27
dr_specialisation = 'No specialisation'
28
try:
29
dr_address_span = card.select_one('span.novip-left-adresszusatz')
30
dr_address = dr_address_span.text.strip() + ' ' + dr_address_span.next_sibling.strip()
31
except Exception as e:
32
dr_address_span = 'No address'
33
if len(card.select_one('span.novip-left-fachbezeichnung').next_sibling.strip()) > 5:
34
dr_address = card.select_one('span.novip-left-fachbezeichnung').next_sibling.strip().replace('n', ' ')
35
elif len(card.select_one('span.novip-left-fachbezeichnung').next_sibling.next_sibling) > 5:
36
dr_address = card.select_one('span.novip-left-fachbezeichnung').next_sibling.next_sibling.text.strip().replace('n', ' ')
37
else:
38
dr_address = card.select_one('span.novip-left-fachbezeichnung').next_sibling.next_sibling.next_sibling.strip().replace('n', ' ')
39
40
dr_list.append((dr_name, dr_title, dr_specialisation, dr_address))
41
df = pd.DataFrame(dr_list, columns = ['Name', 'Title', 'Spec', 'Address'])
42
df.to_csv('swiss_docs.csv')
43
print(df.head())
44
This will save a csv file with dr details, looking like this:
JavaScript
1
10
10
1
Name Title Spec Address
2
0 Wey Barbara Dr. med. Fachärztin FMH für Gynäkologie u. Geburtshilfe Hauptstrasse 12, 6033 Buchrain Tel: 041 444 30 80 Terminanfrage Karte
3
1 Bohl Urs Dr. med. Facharzt FMH für Gynäkologie und Geburtshilfe Seetalstrasse 11, 6020 Emmenbrücke
4
2 Füchsel Glenn Dr. med. Facharzt für Gynäkologie und Geburtshilfe docstation Gesundheitszentrum Emmen Mooshüslistrasse 6, 6032 Emmen
5
3 Dal Pian Désirée Dr. med. Fachärztin FMH für Gynäkologie u. Geburtshilfe Frauenpraxis Zero Plus Am Mattenhof 4a, 6010 Kriens
6
4 Gilke Ursula Dr. med. Fachärztin für Gynäkologie u. Geburtshilfe Schachenstrasse 5, 6010 Kriens
7
5 Amann Stefanie Dr. med. Fachärztin FMH Gynäkologie u. Geburtshilfe Frauenpraxis am See Alpenstrasse 1, 6004 Luzern
8
6 Ballabio Nadja Dr. med. Fachärztin FMH Gynäkologie und Geburtshilfe gyn-zentrum ag Haldenstrasse 11, 6006 Luzern
9
[ ]
10
There are better, more elegant solutions out there. Have a look over the bs4 documentation, at https://www.crummy.com/software/BeautifulSoup/bs4/doc/