Skip to content
Advertisement

Beautifulsoup: extracting td list in table

I’m stuck with a BeautifulSoup problem that I think is simple but I can’t seem to solve. It is about extracting each td from the following table to create a loop and a list:

<table class="tabla-clasificacion-home marratua tablageneral tabla-actas">
<thead>
<tr>
<th scope="col">Team</th>
<th scope="col">Name</th>
<th scope="col">Number</th>
<th scope="col">Tipo</th>
<th scope="col">Motivo</th>
<th scope="col">Minute</th>
<th scope="col">Bloque</th>
</tr>
</thead>
<tbody>
<tr>
<td>Barcelona</td>
<td>Player 1</td>
<td>16</td>
<td>Tarjeta Amarilla</td>
<td>Derribar a un contrario en la disputa del balón</td>
<td>88</td>
<td>Segundo tiempo</td>
</tr> <tr>
<td>Real Madrid</td>
<td>Player 2</td>
<td>8</td>
<td>Tarjeta Amarilla</td>
<td>Sujetar a un adversario impidiendo su avance.</td>
<td>12</td>
<td>Primer tiempo</td>
</tr>
</tbody>
</table>

What I need is to create a dictionary with some elements of each tr to create a dataframe later. I would like to have a list with:

  • Team: Barcelona
  • Name: Player 1
  • Number: 16
  • Minute: 88
  • Team: Real Madrid
  • Name: Player 2
  • Number: 8
  • Minute: 12

As you can see, there are some tds that I don’t need and I’d also like to ‘jump’ on them for my final df.

I’ve tried with this code (I only put a simplified example) but it doesn’t work because I always take the name of the 1st team:

tabla = amonestaciones.find('table', class_='tabla-clasificacion-home marratua tablageneral tabla-actas')

rows = tabla.find_all('tr')

for row in rows:
    team = row.find('td')
    name = row.findNext('td')
    lista = {
        "Team": team,
        "Name": name
    }

This is the output I get (I also would like to remove the code but if I try .text or .get_text() I have the error ‘NoneType’ object has no attribute ‘text’):

{'Team': <td>Real Madrid</td>, 'Name': <td>Real Madrid</td>}

I sense that I’m very close to the solution but I am stuck and I can’t move forward. Thanks in advance for your help!

Advertisement

Answer

If you feel like learning something new, you don’t even need bs4 (well, sort of). All you need is pandas (you get a dataframe out of the box) to get this:

-  -----------  --------  --  ----------------  -----------------------------------------------  --  --------------
0  Barcelona    Player 1  16  Tarjeta Amarilla  Derribar a un contrario en la disputa del balón  88  Segundo tiempo
1  Real Madrid  Player 2   8  Tarjeta Amarilla  Sujetar a un adversario impidiendo su avance.    12  Primer tiempo
-  -----------  --------  --  ----------------  -----------------------------------------------  --  --------------

With this:

import pandas as pd
from tabulate import tabulate

sample_html = """
<table class="tabla-clasificacion-home marratua tablageneral tabla-actas">
<thead>
<tr>
<th scope="col">Team</th>
<th scope="col">Name</th>
<th scope="col">Number</th>
<th scope="col">Tipo</th>
<th scope="col">Motivo</th>
<th scope="col">Minute</th>
<th scope="col">Bloque</th>
</tr>
</thead>
<tbody>
<tr>
<td>Barcelona</td>
<td>Player 1</td>
<td>16</td>
<td>Tarjeta Amarilla</td>
<td>Derribar a un contrario en la disputa del balón</td>
<td>88</td>
<td>Segundo tiempo</td>
</tr> <tr>
<td>Real Madrid</td>
<td>Player 2</td>
<td>8</td>
<td>Tarjeta Amarilla</td>
<td>Sujetar a un adversario impidiendo su avance.</td>
<td>12</td>
<td>Primer tiempo</td>
</tr>
</tbody>
</table>
"""

df = pd.read_html(sample_html, flavor="bs4")
df = pd.concat(df)
print(tabulate(df))
df.to_csv("your_table.csv", index=False)

The code also dumps your table to a .csv file:

enter image description here

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement