Skip to content
Advertisement

How to parse HTML table that is inside div and not table in Python

I am trying to parse the table from this website. I started with just the Username column and with the help I got on stackoverflow, I was able to get the content of Username with the following code:

with open("Top 50 TikTok users sorted by Followers - Socialblade TikTok Stats _ TikTok Statistics.html", "r", encoding="utf-8") as file:
    soup = BeautifulSoup(str(file.readlines()), "html.parser")

tiktok = []
for tag in soup.select("div div:nth-of-type(n+5) > div > a"):
    tiktok.append(tag.text)

which gives me

['addison rae',
 'Bella Poarch',
 'Zach King',
 'TikTok',
 'Spencer X',
 'Will Smith',
 'Loren Gray',
 'dixie',
 'Michael Le',
 'Jason Derulo',
 'Riyaz',
.
.
.

My ultimate goal is to populate the entire table with [Rank, Grade, Username, Uploads, Followers, Following, Likes]

I have read a few articles on Parsing HTML Tables in Python with BeautifulSoup and pandas but it didn’t work since this is not defined as a table in the source. What are some of the alternatives to get this as a table in Python?

Advertisement

Answer

You can use this code how to load the HTML from file to soup and then parse the table into dataframe:

import pandas as pd
from bs4 import BeautifulSoup

soup = BeautifulSoup(open("page.html", "r").read(), "html.parser")

data = []
for div in soup.select('div[style*="fafafa"], div[style*="f8f8f8"]'):
    data.append(
        [
            d.get_text(strip=True)
            for d in div.find_all("div", recursive=False)[:8]
        ]
    )


df = pd.DataFrame(
    data,
    columns=[
        "Rank",
        "Grade",
        "Username",
        "Uploads",
        "Followers",
        "Following",
        "Likes",
        "Interactions",
    ],
)
print(df)
df.to_csv("data.csv", index=False)

Prints:

    Rank Grade           Username Uploads    Followers Following          Likes Interactions
0    1st   A++    charli d’amelio   1,755  113,600,000     1,210  9,200,000,000           --
1    2nd   A++        addison rae   1,411   79,900,000     2,454  5,100,000,000           --
2    3rd   A++       Bella Poarch     282   63,600,000       588  1,400,000,000           --
3    4th   A++          Zach King     277   58,800,000        41    723,400,000           --
4    5th   A++             TikTok     139   52,900,000       495    250,300,000           91
5    6th   A++          Spencer X   1,250   52,700,000     7,206  1,300,000,000           --
6    7th   A++         Will Smith      73   52,500,000        23    314,400,000           --
7    8th   A++         Loren Gray   2,805   52,100,000       221  2,800,000,000           --
8    9th   A++              dixie     120   51,200,000     1,267  2,900,000,000           --
9   10th   A++         Michael Le   1,158   47,400,000        93  1,300,000,000           --
10  11th    A+       Jason Derulo     675   44,900,000        12  1,000,000,000           --
11  12th    A+              Riyaz   2,056   44,100,000        43  2,100,000,000           --
12  13th    A+  Kimberly Loaiza ✨   1,150   41,000,000       123  2,200,000,000           --
13  14th    A+       Brent Rivera     955   37,800,000       272  1,200,000,000           --
14  15th    A+           cznburak   1,301   37,300,000         1    688,700,000           --
15  16th    A+           The Rock      42   36,200,000         1    200,300,000           --
16  17th    A+      James Charles     238   36,200,000       148    881,400,000           --
17  18th    A+          BabyAriel   2,365   35,300,000       326  1,900,000,000           --
18  19th    A+          JoJo Siwa   1,206   33,500,000       346  1,100,000,000           --
19  20th    A+              avani   5,347   33,300,000     5,003  2,400,000,000           --
20  21st    A+          GIL CROES     693   32,900,000       454    803,200,000           --
21  22nd    A+      Faisal shaikh     461   32,200,000        --  2,000,000,000           --
22  23rd    A+                BTS      39   32,000,000        --    557,100,000          255
23  24th    A+           LILHUDDY   4,187   30,500,000     8,652  1,600,000,000           --
24  25th    A+       Stokes Twins     548   30,100,000        21    781,000,000           --
25  26th    A+                Joe   1,487   29,800,000     8,402  1,200,000,000           --
26  27th    A+               ROD🥴   1,792   29,500,000       536  1,700,000,000           --
27  28th    A+            𝙳𝚘𝚖𝚒𝚗𝚒𝚔     899   29,400,000       216  1,700,000,000           --
28  29th    A+       Kylie Jenner      69   29,400,000        14    318,800,000           --
29  30th    A+         Junya/じゅんや   2,823   29,000,000     1,934    533,800,000       12,200
30  31st    A+                 YZ     816   28,900,000       563    554,700,000           --
31  32nd    A+      Arishfa Khan🦁   2,026   28,600,000        27  1,100,000,000           --
32  33rd    A+   Lucas and Marcus   1,248   28,500,000       158    806,500,000           --
33  34th    A+    jannat_zubair29   1,054   28,200,000         6    746,300,000           47
34  35th    A+     Nisha Guragain   1,751   28,000,000        33    756,300,000           --
35  36th    A+       Selena Gomez      40   27,800,000        17     82,300,000           --
36  37th    A+            Kris HC   1,049   27,800,000     1,405  1,200,000,000           --
37  38th    A+        flighthouse   4,200   27,600,000       488  2,300,000,000           --
38  39th    A+         wigofellas   1,251   27,500,000       812    707,200,000           --
39  40th    A+   Savannah LaBrant   1,860   27,300,000       155  1,400,000,000           --
40  41st    A+          noah beck   1,395   26,900,000     2,297  1,700,000,000           --
41  42nd    A+         Liza Koshy     155   26,700,000       104    321,900,000           --
42  43rd    A+   Kirya Kolesnikov   1,338   26,400,000        78    543,200,000           --
43  44th    A+        Awez Darbar   2,708   26,100,000       208  1,100,000,000           --
44  45th    A+       Carlos Feria   2,522   25,700,000       138  1,200,000,000           --
45  46th    A+       Kira Kosarin     837   25,700,000       401    447,000,000           --
46  47th    A+     Naim Darrechi🏆   2,634   25,300,000       527  2,200,000,000           --
47  48th    A+      Josh Richards   1,899   24,900,000     9,847  1,600,000,000           --
48  49th    A+             Q Park     231   24,800,000         3    294,100,000           --
49  50th    A+       TikTok_India     186   24,500,000       191     40,100,000           --

And saves data.csv (screenshot from LibreOffice):

enter image description here


EDIT: To get URL username:

import pandas as pd
from bs4 import BeautifulSoup

soup = BeautifulSoup(open("page.html", "r").read(), "html.parser")

data = []
for div in soup.select('div[style*="fafafa"], div[style*="f8f8f8"]'):

    data.append(
        [
            d.get_text(strip=True)
            for d in div.find_all("div", recursive=False)[:8]
        ]
        + [div.a["href"].split("/")[-1]]
    )


df = pd.DataFrame(
    data,
    columns=[
        "Rank",
        "Grade",
        "Username",
        "Uploads",
        "Followers",
        "Following",
        "Likes",
        "Interactions",
        "URL username",
    ],
)

print(df)
df.to_csv("data.csv", index=False)
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement