I have the following tables: I can traverse towards this part of the HTML using the following code below: I am able to get the text using the one below: The output I want to arrive to is a result of all text then separated by a semicolon (;) like so: ANGARA, EDGARDO J.;ENRILE, JUAN PONCE;MAGSAYSAY JR., RAMON B.;ROXAS, MAR;GORDON,

Python

I have the following tables:

<table width="100%" border="0" cellspacing="2" cellpadding="0">
                      <tbody><tr> 
                        <td class="labelplain">ANGARA, EDGARDO J.<br>ENRILE, JUAN PONCE<br>MAGSAYSAY JR., RAMON B.<br>ROXAS, MAR<br>GORDON, RICHARD "DICK" J.<br>FLAVIER, JUAN M.<br>MADRIGAL, M. A.<br>ARROYO, JOKER P.<br>RECTO, RALPH G.<br></td>
                      </tr>
                    </tbody></table>

JavaScript
​x
 
<table width="100%" border="0" cellspacing="2" cellpadding="0">
                      <tbody><tr> 
                        <td class="labelplain">ANGARA, EDGARDO J.<br>ENRILE, JUAN PONCE<br>MAGSAYSAY JR., RAMON B.<br>ROXAS, MAR<br>GORDON, RICHARD "DICK" J.<br>FLAVIER, JUAN M.<br>MADRIGAL, M. A.<br>ARROYO, JOKER P.<br>RECTO, RALPH G.<br></td>
                      </tr>
                    </tbody></table>
​

I can traverse towards this part of the HTML using the following code below:

soup.find('td', text = re.compile('Co-author(s'), attrs={'class': 'labelplain'}).find_next('td')
coauthor = soup.find('td', text = re.compile('Co-author(s'), attrs={'class': 'labelplain'}).find_next('td')

JavaScript
 
soup.find('td', text = re.compile('Co-author(s'), attrs={'class': 'labelplain'}).find_next('td')
coauthor = soup.find('td', text = re.compile('Co-author(s'), attrs={'class': 'labelplain'}).find_next('td')
​

I am able to get the text using the one below:

for br in coauthor.find_all('br'):
  firstcoauthor = (br.previousSibling)
  print (firstcoauthor)

JavaScript
 
for br in coauthor.find_all('br'):
  firstcoauthor = (br.previousSibling)
  print (firstcoauthor)
​

The output I want to arrive to is a result of all text then separated by a semicolon (;) like so: ANGARA, EDGARDO J.;ENRILE, JUAN PONCE;MAGSAYSAY JR., RAMON B.;ROXAS, MAR;GORDON, RICHARD “DICK” J.;FLAVIER, JUAN M.;MADRIGAL, M. A.;ARROYO, JOKER P.;RECTO, RALPH G.

But the code above gives me a result like below:

ANGARA, EDGARDO J.
ENRILE, JUAN PONCE
MAGSAYSAY JR., RAMON B.
ROXAS, MAR
GORDON, RICHARD "DICK" J.
FLAVIER, JUAN M.
MADRIGAL, M. A.
ARROYO, JOKER P.
RECTO, RALPH G.

JavaScript
 
ANGARA, EDGARDO J.
ENRILE, JUAN PONCE
MAGSAYSAY JR., RAMON B.
ROXAS, MAR
GORDON, RICHARD "DICK" J.
FLAVIER, JUAN M.
MADRIGAL, M. A.
ARROYO, JOKER P.
RECTO, RALPH G.
​

I’ve tried the replace function but to no avail.

print (firstcoauthor.replace("n", ";"))

JavaScript
 
print (firstcoauthor.replace("n", ";"))
​

and

print (firstcoauthor.replace("rn", ";"))

JavaScript
 
print (firstcoauthor.replace("rn", ";"))
​

Even escape rn and n like so:

print (firstcoauthor.replace("\n", ";"))

JavaScript
 
print (firstcoauthor.replace("\n", ";"))
​

How do I address my use case?

Answer

Think it is much more simpler to get that result with setting join/delimiter parameter to get_text():

soup.find('td').get_text(';')

JavaScript
 
soup.find('td').get_text(';')
​

Based on your example you will get:

ANGARA, EDGARDO J.;ENRILE, JUAN PONCE;MAGSAYSAY JR., RAMON B.;ROXAS, MAR;GORDON, RICHARD "DICK" J.;FLAVIER, JUAN M.;MADRIGAL, M. A.;ARROYO, JOKER P.;RECTO, RALPH G.

JavaScript
 
ANGARA, EDGARDO J.;ENRILE, JUAN PONCE;MAGSAYSAY JR., RAMON B.;ROXAS, MAR;GORDON, RICHARD "DICK" J.;FLAVIER, JUAN M.;MADRIGAL, M. A.;ARROYO, JOKER P.;RECTO, RALPH G.
​

EDIT

Based on the behaviour, extra semicolons, mentioned in your comment, I suspect that the structure of the element is different from the one in the question and has extra breaks.

In that case, I would change the strategy and recommend to:

add additional strip parameter to get_text():

soup.find('td').get_text(';', strip=True)

JavaScript
 
soup.find('td').get_text(';', strip=True)
​

or use a join() from stripped_strings, what is doing almost the same:

';'.join(soup.find('td').stripped_strings)

JavaScript
 
';'.join(soup.find('td').stripped_strings)
​

Example HTML

Added additional <br>, spaces and linebreaks to the HTML.

html = '''
<table width="100%" border="0" cellspacing="2" cellpadding="0">
    <tbody><tr>
    
    <br>
           <td class="labelplain">
           ANGARA, EDGARDO J.<br>ENRILE, JUAN PONCE<br>MAGSAYSAY JR., RAMON B.<br>ROXAS, MAR<br>GORDON, RICHARD "DICK" J.<br>FLAVIER, JUAN M.<br>MADRIGAL, M. A.<br>ARROYO, JOKER P.<br>RECTO, RALPH G.<br> 
           
           <br>
           </td>
           </tr>
</tbody></table>'''

JavaScript
 
html = '''
<table width="100%" border="0" cellspacing="2" cellpadding="0">
    <tbody><tr>
    
    <br>
           <td class="labelplain">
           ANGARA, EDGARDO J.<br>ENRILE, JUAN PONCE<br>MAGSAYSAY JR., RAMON B.<br>ROXAS, MAR<br>GORDON, RICHARD "DICK" J.<br>FLAVIER, JUAN M.<br>MADRIGAL, M. A.<br>ARROYO, JOKER P.<br>RECTO, RALPH G.<br> 
           
           <br>
           </td>
           </tr>
</tbody></table>'''
​

Output

ANGARA, EDGARDO J.;ENRILE, JUAN PONCE;MAGSAYSAY JR., RAMON B.;ROXAS, MAR;GORDON, RICHARD "DICK" J.;FLAVIER, JUAN M.;MADRIGAL, M. A.;ARROYO, JOKER P.;RECTO, RALPH G.

JavaScript
 
ANGARA, EDGARDO J.;ENRILE, JUAN PONCE;MAGSAYSAY JR., RAMON B.;ROXAS, MAR;GORDON, RICHARD "DICK" J.;FLAVIER, JUAN M.;MADRIGAL, M. A.;ARROYO, JOKER P.;RECTO, RALPH G.
​

Python beautifulsoup – get all text separated by break tag

Advertisement

Answer

EDIT

Example HTML

Output