Want to combine two <table>
, one with header another with table values, the first table consist with <table>
, <thead>
and no value in <tbody>
with header information only, the second table consist with <table>
, no value in <thead>
and <tbody>
with table value only
HTML code
html = """<div style="border: 1px solid #000;"> <div style="background-color:#005297;"> <table id="CCCCCT" class="BBBBBt" style="width: calc(100% - 16px)"> <thead> <tr> <td><span class="AAAAA">DD </span> EE</td><td>FF</td><td>GG</td><td>HH</td><td>II</td> </tr> </thead> <tbody></tbody> </table> </div> <table id="CCCCC" class="BBBBB"> <thead> <tr> <td></td><td></td><td></td><td></td><td></td> </tr> </thead> <tbody> <tr class="JJJJJ""><td><div>1111111</div></td><td>M</td><td>4444444</td><td><div>77777<i class="PPPPPP"></i> 10101010101</div></td><td><span class="">13131313131aa</span></td></tr> <tr class="KKKKK"><td><div>2222222</div></td><td>N</td><td>5555555</td><td><div>88888<i class="PPPPPP"></i> 1111111111</div></td><td><span class="QQQQQ">1414141414141aa</span></td> </tr> <tr class="LLLLL"><td><div>3333333</div></td><td>O</td><td>6666666</td><td><div>999999<i class="PPPPPP"></i> 1212121212121</div></td><td><span class="">15151515151aa</span></td></tr> </tbody> </table> </div>"""
Python Code
from bs4 import BeautifulSoup import pandas as pd import re soup = BeautifulSoup(html,'html.parser') table = soup.find('div', attrs={'style':re.compile("^border:.*$")}) df_list = pd.read_html(str(table)) df_list
Execution Result
[Empty DataFrame Columns: [DD EE, FF, GG, HH, II] Index: [], Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 0 1111111 M 4444444 77777 10101010101 13131313131aa 1 2222222 N 5555555 88888 1111111111 1414141414141aa 2 3333333 O 6666666 999999 1212121212121 15151515151aa]
Expected Result (5 columns)
DD EE FF GG HH II 0 1111111 M 4444444 77777 10101010101 13131313131aa 1 2222222 N 5555555 88888 1111111111 1414141414141aa 2 3333333 O 6666666 999999 1212121212121 15151515151aa]
Advertisement
Answer
import pandas as pd # pd.read_html can read url directly as that's already implemented under the neath df = pd.read_html("URL DIRECTLY") df[1].columns = df[0].columns print(df[1])
Output:
DD EE FF GG HH II 0 1111111 M 4444444 77777 10101010101 13131313131aa 1 2222222 N 5555555 88888 1111111111 1414141414141aa 2 3333333 O 6666666 999999 1212121212121 15151515151aa
Or applying to your example directly:
import pandas as pd html = """<div style="border: 1px solid #000;"> <div style="background-color:#005297;"> <table id="CCCCCT" class="BBBBBt" style="width: calc(100% - 16px)"> <thead> <tr> <td><span class="AAAAA">DD </span> EE</td> <td>FF</td> <td>GG</td> <td>HH</td> <td>II</td> </tr> </thead> <tbody></tbody> </table> </div> <table id="CCCCC" class="BBBBB"> <thead> <tr> <td></td> <td></td> <td></td> <td></td> <td></td> </tr> </thead> <tbody> <tr class="JJJJJ""> <td> <div>1111111</div> </td> <td>M</td> <td>4444444</td> <td> <div>77777<i class="PPPPPP"></i> 10101010101</div> </td> <td><span class="">13131313131aa</span></td> </tr> <tr class="KKKKK"> <td> <div>2222222</div> </td> <td>N</td> <td>5555555</td> <td> <div>88888<i class="PPPPPP"></i> 1111111111</div> </td> <td><span class="QQQQQ">1414141414141aa</span></td> </tr> <tr class="LLLLL"> <td> <div>3333333</div> </td> <td>O</td> <td>6666666</td> <td> <div>999999<i class="PPPPPP"></i> 1212121212121</div> </td> <td><span class="">15151515151aa</span></td> </tr> </tbody> </table>""" df = pd.read_html(html) df[1].columns = df[0].columns print(df[1])
Will output the same.
Feel free to use attrs
according to your needs.