I have a tag which looks like this
<div class="small text-gray mb-2"> <div> Pierre M <!-- --> , <!-- --> 08/18/2018 <!-- --> <div class="d-inline-block px-0_25 text-white bg-primary-darker rounded"> updated <!-- --> 03/11/2021 </div> </div> <div>Long Range 4dr Sedan (electric DD)</div> </div>
I would like to get only the name and surname so the “Pierre M” and the date “08/18/2018”
I was trying this code
import bs4 soup = BeautifulSoup() data = [] for e in content_list: data.append({ 'reviewer-name':e.select_one('div').text, 'reviewe-date':e.select_one('div').text, })
But it results in taking every value from that tag so I get
'reviewe-date': 'John Schreiber, 10/06/2018 updated 10/08/2019Long Range 4dr Sedan (electric DD)', 'reviewer-name': 'John Schreiber, 10/06/2018 updated 10/08/2019Long Range 4dr Sedan (electric DD)'
Advertisement
Answer
You could go with find_all(text=True, recursive=False)
to get only the first section of text in your specific case:
for e in soup.select('div.small'): data.append({ 'reviewer-name':''.join(e.div.find_all(text=True, recursive=False)).split(',')[0].strip(), 'reviewe-date':''.join(e.div.find_all(text=True, recursive=False)).split(',')[-1].strip(), })
Alternativ would be to check for child <div>
with updated, save its text if needed and decompose()
it from the DOM
–
use of walrus operator
needs python
3.8 or later else use standard if statement
):
for e in soup.select('div.small'): if (u := e.select_one('div.rounded')): updated = u.text.split('updated')[-1].strip() u.decompose() else: updated = None data.append({ 'reviewer-name':e.div.text.split(',')[0].strip(), 'reviewe-date':e.div.text.split(',')[-1].strip(), 'reviewe-updated':updated })
Example
from bs4 import BeautifulSoup html = ''' <div class="small text-gray mb-2"> <div> Pierre M <!-- --> , <!-- --> 08/18/2018 <!-- --> <div class="d-inline-block px-0_25 text-white bg-primary-darker rounded"> updated <!-- --> 03/11/2021 </div> </div> <div>Long Range 4dr Sedan (electric DD)</div> </div> ''' soup = BeautifulSoup(html) data = [] for e in soup.select('div.small'): if (u := e.select_one('div.rounded')): updated = u.text.split('updated')[-1].strip() u.decompose() else: updated = None data.append({ 'reviewer-name':e.div.text.split(',')[0].strip(), 'reviewe-date':e.div.text.split(',')[-1].strip(), 'reviewe-updated':updated }) data
Output
[{'reviewer-name': 'Pierre M', 'reviewe-date': '08/18/2018', 'reviewe-updated': '03/11/2021'}]