I have a tag which looks like this
JavaScript
x
17
17
1
<div class="small text-gray mb-2">
2
<div>
3
Pierre M
4
<!-- -->
5
,
6
<!-- -->
7
08/18/2018
8
<!-- -->
9
<div class="d-inline-block px-0_25 text-white bg-primary-darker rounded">
10
updated
11
<!-- -->
12
03/11/2021
13
</div>
14
</div>
15
<div>Long Range 4dr Sedan (electric DD)</div>
16
</div>
17
I would like to get only the name and surname so the “Pierre M” and the date “08/18/2018”
I was trying this code
JavaScript
1
10
10
1
import bs4
2
soup = BeautifulSoup()
3
data = []
4
5
for e in content_list:
6
data.append({
7
'reviewer-name':e.select_one('div').text,
8
'reviewe-date':e.select_one('div').text,
9
})
10
But it results in taking every value from that tag so I get
JavaScript
1
3
1
'reviewe-date': 'John Schreiber, 10/06/2018 updated 10/08/2019Long Range 4dr Sedan (electric DD)',
2
'reviewer-name': 'John Schreiber, 10/06/2018 updated 10/08/2019Long Range 4dr Sedan (electric DD)'
3
Advertisement
Answer
You could go with find_all(text=True, recursive=False)
to get only the first section of text in your specific case:
JavaScript
1
6
1
for e in soup.select('div.small'):
2
data.append({
3
'reviewer-name':''.join(e.div.find_all(text=True, recursive=False)).split(',')[0].strip(),
4
'reviewe-date':''.join(e.div.find_all(text=True, recursive=False)).split(',')[-1].strip(),
5
})
6
Alternativ would be to check for child <div>
with updated, save its text if needed and decompose()
it from the DOM
–
use of walrus operator
needs python
3.8 or later else use standard if statement
):
JavaScript
1
12
12
1
for e in soup.select('div.small'):
2
if (u := e.select_one('div.rounded')):
3
updated = u.text.split('updated')[-1].strip()
4
u.decompose()
5
else:
6
updated = None
7
data.append({
8
'reviewer-name':e.div.text.split(',')[0].strip(),
9
'reviewe-date':e.div.text.split(',')[-1].strip(),
10
'reviewe-updated':updated
11
})
12
Example
JavaScript
1
38
38
1
from bs4 import BeautifulSoup
2
html = '''
3
<div class="small text-gray mb-2">
4
<div>
5
Pierre M
6
<!-- -->
7
,
8
<!-- -->
9
08/18/2018
10
<!-- -->
11
<div class="d-inline-block px-0_25 text-white bg-primary-darker rounded">
12
updated
13
<!-- -->
14
03/11/2021
15
</div>
16
</div>
17
<div>Long Range 4dr Sedan (electric DD)</div>
18
</div>
19
'''
20
21
22
soup = BeautifulSoup(html)
23
data = []
24
25
for e in soup.select('div.small'):
26
if (u := e.select_one('div.rounded')):
27
updated = u.text.split('updated')[-1].strip()
28
u.decompose()
29
else:
30
updated = None
31
data.append({
32
'reviewer-name':e.div.text.split(',')[0].strip(),
33
'reviewe-date':e.div.text.split(',')[-1].strip(),
34
'reviewe-updated':updated
35
})
36
37
data
38
Output
JavaScript
1
2
1
[{'reviewer-name': 'Pierre M', 'reviewe-date': '08/18/2018', 'reviewe-updated': '03/11/2021'}]
2