Skip to content
Advertisement

Combine two tables, one with header only, another with table values for bs4

Want to combine two <table> , one with header another with table values, the first table consist with <table>, <thead> and no value in <tbody> with header information only, the second table consist with <table>, no value in <thead> and <tbody> with table value only

HTML code

html = """<div style="border: 1px solid #000;">
<div style="background-color:#005297;">
<table id="CCCCCT" class="BBBBBt" style="width: calc(100% - 16px)">
<thead>
<tr>
<td><span class="AAAAA">DD </span> EE</td><td>FF</td><td>GG</td><td>HH</td><td>II</td>
</tr>
</thead>
<tbody></tbody>
</table>
</div>
<table id="CCCCC" class="BBBBB">
<thead>
<tr>
<td></td><td></td><td></td><td></td><td></td>
</tr>
</thead>
<tbody>
<tr class="JJJJJ&quot;"><td><div>1111111</div></td><td>M</td><td>4444444</td><td><div>77777<i 
class="PPPPPP"></i> 10101010101</div></td><td><span class="">13131313131aa</span></td></tr>
<tr class="KKKKK"><td><div>2222222</div></td><td>N</td><td>5555555</td><td><div>88888<i 
class="PPPPPP"></i> 1111111111</div></td><td><span class="QQQQQ">1414141414141aa</span></td> 
</tr>
<tr class="LLLLL"><td><div>3333333</div></td><td>O</td><td>6666666</td><td><div>999999<i 
class="PPPPPP"></i> 1212121212121</div></td><td><span class="">15151515151aa</span></td></tr>
</tbody>
</table>
</div>"""

Python Code

from bs4 import BeautifulSoup
import pandas as pd
import re

soup = BeautifulSoup(html,'html.parser')

table = soup.find('div', attrs={'style':re.compile("^border:.*$")})
df_list = pd.read_html(str(table))
df_list

Execution Result

[Empty DataFrame
 Columns: [DD EE, FF, GG, HH, II]
 Index: [],
 Unnamed: 0 Unnamed: 1  Unnamed: 2            Unnamed: 3       Unnamed: 4
 0     1111111          M     4444444     77777 10101010101    13131313131aa
 1     2222222          N     5555555      88888 1111111111  1414141414141aa
 2     3333333          O     6666666  999999 1212121212121    15151515151aa]

Expected Result (5 columns)

         DD EE        FF        GG               HH              II
 0     1111111          M     4444444     77777 10101010101    13131313131aa
 1     2222222          N     5555555      88888 1111111111  1414141414141aa
 2     3333333          O     6666666  999999 1212121212121    15151515151aa]

Advertisement

Answer

import pandas as pd

# pd.read_html can read url directly as that's already implemented under the neath
df = pd.read_html("URL DIRECTLY")
df[1].columns = df[0].columns
print(df[1])

Output:

     DD EE FF       GG                    HH               II
0  1111111  M  4444444     77777 10101010101    13131313131aa
1  2222222  N  5555555      88888 1111111111  1414141414141aa
2  3333333  O  6666666  999999 1212121212121    15151515151aa

Or applying to your example directly:

import pandas as pd

html = """<div style="border: 1px solid #000;">
    <div style="background-color:#005297;">
        <table id="CCCCCT" class="BBBBBt" style="width: calc(100% - 16px)">
            <thead>
                <tr>
                    <td><span class="AAAAA">DD </span> EE</td>
                    <td>FF</td>
                    <td>GG</td>
                    <td>HH</td>
                    <td>II</td>
                </tr>
            </thead>
            <tbody></tbody>
        </table>
    </div>
    <table id="CCCCC" class="BBBBB">
        <thead>
            <tr>
                <td></td>
                <td></td>
                <td></td>
                <td></td>
                <td></td>
            </tr>
        </thead>
        <tbody>
            <tr class="JJJJJ&quot;">
                <td>
                    <div>1111111</div>
                </td>
                <td>M</td>
                <td>4444444</td>
                <td>
                    <div>77777<i class="PPPPPP"></i> 10101010101</div>
                </td>
                <td><span class="">13131313131aa</span></td>
            </tr>
            <tr class="KKKKK">
                <td>
                    <div>2222222</div>
                </td>
                <td>N</td>
                <td>5555555</td>
                <td>
                    <div>88888<i class="PPPPPP"></i> 1111111111</div>
                </td>
                <td><span class="QQQQQ">1414141414141aa</span></td>
            </tr>
            <tr class="LLLLL">
                <td>
                    <div>3333333</div>
                </td>
                <td>O</td>
                <td>6666666</td>
                <td>
                    <div>999999<i class="PPPPPP"></i> 1212121212121</div>
                </td>
                <td><span class="">15151515151aa</span></td>
            </tr>
        </tbody>
    </table>"""


df = pd.read_html(html)
df[1].columns = df[0].columns
print(df[1])

Will output the same.

Feel free to use attrs according to your needs.

Advertisement