Skip to content
Advertisement

Using Decompose to remove empty tag

I am trying to search for emails in HTML elements. I want to run the code so that when there are no emails found in the HTML, to search in another element in the HTML and in the end if it is not found to set email as “N/A”.

I am new to writing code and I am trying to do it for a training exercise for a project.

Here is the HTML that I am trying to break down and extract the emails from:

<div class="Profile-sidebar">
   <div class="Profile-header">
      <div class="Profile-userDetails">
         <p class="Profile-line"><a class="Profile"> Search Location No.1</a></p>
      </div>
   </div>
   <div class="UserInfo" style="">
      <div class="UserInfo">
         <div class="UserInfo-Header">
            <h5 class="UserInfo-Title">About</h5>
         </div>
         <div class="UserInfo-column">
            <p class="UserInfo-bioHeader">About</p>
            <div class="UserInfo"><span>Search Location No.2</span></div>
         </div>
      </div>
   </div>
</div>

Here is the python code where I have created an empty list after extracting text from bio I search for emails and if the tag is empty it decomposes the tag:

email_list = []
    bio = soup.find('div', {'class': 'UserInfo'}).text
    for my_tag in soup.find_all(class_="UserInfo"):
        EMAIL_REGEX = "[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+"
        emails = re.findall(EMAIL_REGEX, my_tag.text)
        if not my_tag.text:  # if tag is empty
            my_tag.decompose()
            print(emails)

the outcome that I am receiving when I print(emails) , if there are no emails present in the for loop, in which I am trying to get rid of:

[]
[]
[]

My question:

The HTML which I am breaking down has similar classes under the same tag. My issue is that I just want to know how to search from one element with a specific class and if no outcome is found to search in another element with another class and in the end instead of receiving [] [] [] to become N/A

Advertisement

Answer

Rather than going iteratively class by class, why not go top to bottom across the whole HTML irrespective of the class, and if you find an EMAIL, just store the EMAIL along with the class of the element in a dictionary. And then you can find email from the dictionary based on which class you want to check first.

EMAIL_REGEX = "[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+"
def applyRegex(element):
    if element.text:
       emailsFound = re.findall(EMAIL_REGEX, element.text)
       if emailsFound:
          return True
   return False


final_dict = {}
email_elements = soup.find_all(applyRegex)

for element in email_elements:
   emailsFound = re.findall(EMAIL_REGEX, element.text)
   for email in emailsFound:
      if element.has_attr('class'):
         classname = element['class']
         final_dict.update({classname: element.text})

if final_dict:
   # do whatever you want to do with the dictionary of <class>:<email>
else:
   print("N/A")
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement