I am trying to parse the table from this website. I started with just the Username
column and with the help I got on stackoverflow, I was able to get the content of Username
with the following code:
JavaScript
x
7
1
with open("Top 50 TikTok users sorted by Followers - Socialblade TikTok Stats _ TikTok Statistics.html", "r", encoding="utf-8") as file:
2
soup = BeautifulSoup(str(file.readlines()), "html.parser")
3
4
tiktok = []
5
for tag in soup.select("div div:nth-of-type(n+5) > div > a"):
6
tiktok.append(tag.text)
7
which gives me
JavaScript
1
15
15
1
['addison rae',
2
'Bella Poarch',
3
'Zach King',
4
'TikTok',
5
'Spencer X',
6
'Will Smith',
7
'Loren Gray',
8
'dixie',
9
'Michael Le',
10
'Jason Derulo',
11
'Riyaz',
12
.
13
.
14
.
15
My ultimate goal is to populate the entire table with [Rank, Grade, Username, Uploads, Followers, Following, Likes]
I have read a few articles on Parsing HTML Tables in Python with BeautifulSoup and pandas
but it didn’t work since this is not defined as a table in the source. What are some of the alternatives to get this as a table in Python?
Advertisement
Answer
You can use this code how to load the HTML from file to soup and then parse the table into dataframe:
JavaScript
1
31
31
1
import pandas as pd
2
from bs4 import BeautifulSoup
3
4
soup = BeautifulSoup(open("page.html", "r").read(), "html.parser")
5
6
data = []
7
for div in soup.select('div[style*="fafafa"], div[style*="f8f8f8"]'):
8
data.append(
9
[
10
d.get_text(strip=True)
11
for d in div.find_all("div", recursive=False)[:8]
12
]
13
)
14
15
16
df = pd.DataFrame(
17
data,
18
columns=[
19
"Rank",
20
"Grade",
21
"Username",
22
"Uploads",
23
"Followers",
24
"Following",
25
"Likes",
26
"Interactions",
27
],
28
)
29
print(df)
30
df.to_csv("data.csv", index=False)
31
Prints:
JavaScript
1
52
52
1
Rank Grade Username Uploads Followers Following Likes Interactions
2
0 1st A++ charli d’amelio 1,755 113,600,000 1,210 9,200,000,000 --
3
1 2nd A++ addison rae 1,411 79,900,000 2,454 5,100,000,000 --
4
2 3rd A++ Bella Poarch 282 63,600,000 588 1,400,000,000 --
5
3 4th A++ Zach King 277 58,800,000 41 723,400,000 --
6
4 5th A++ TikTok 139 52,900,000 495 250,300,000 91
7
5 6th A++ Spencer X 1,250 52,700,000 7,206 1,300,000,000 --
8
6 7th A++ Will Smith 73 52,500,000 23 314,400,000 --
9
7 8th A++ Loren Gray 2,805 52,100,000 221 2,800,000,000 --
10
8 9th A++ dixie 120 51,200,000 1,267 2,900,000,000 --
11
9 10th A++ Michael Le 1,158 47,400,000 93 1,300,000,000 --
12
10 11th A+ Jason Derulo 675 44,900,000 12 1,000,000,000 --
13
11 12th A+ Riyaz 2,056 44,100,000 43 2,100,000,000 --
14
12 13th A+ Kimberly Loaiza ✨ 1,150 41,000,000 123 2,200,000,000 --
15
13 14th A+ Brent Rivera 955 37,800,000 272 1,200,000,000 --
16
14 15th A+ cznburak 1,301 37,300,000 1 688,700,000 --
17
15 16th A+ The Rock 42 36,200,000 1 200,300,000 --
18
16 17th A+ James Charles 238 36,200,000 148 881,400,000 --
19
17 18th A+ BabyAriel 2,365 35,300,000 326 1,900,000,000 --
20
18 19th A+ JoJo Siwa 1,206 33,500,000 346 1,100,000,000 --
21
19 20th A+ avani 5,347 33,300,000 5,003 2,400,000,000 --
22
20 21st A+ GIL CROES 693 32,900,000 454 803,200,000 --
23
21 22nd A+ Faisal shaikh 461 32,200,000 -- 2,000,000,000 --
24
22 23rd A+ BTS 39 32,000,000 -- 557,100,000 255
25
23 24th A+ LILHUDDY 4,187 30,500,000 8,652 1,600,000,000 --
26
24 25th A+ Stokes Twins 548 30,100,000 21 781,000,000 --
27
25 26th A+ Joe 1,487 29,800,000 8,402 1,200,000,000 --
28
26 27th A+ ROD🥴 1,792 29,500,000 536 1,700,000,000 --
29
27 28th A+ 𝙳𝚘𝚖𝚒𝚗𝚒𝚔 899 29,400,000 216 1,700,000,000 --
30
28 29th A+ Kylie Jenner 69 29,400,000 14 318,800,000 --
31
29 30th A+ Junya/じゅんや 2,823 29,000,000 1,934 533,800,000 12,200
32
30 31st A+ YZ 816 28,900,000 563 554,700,000 --
33
31 32nd A+ Arishfa Khan🦁 2,026 28,600,000 27 1,100,000,000 --
34
32 33rd A+ Lucas and Marcus 1,248 28,500,000 158 806,500,000 --
35
33 34th A+ jannat_zubair29 1,054 28,200,000 6 746,300,000 47
36
34 35th A+ Nisha Guragain 1,751 28,000,000 33 756,300,000 --
37
35 36th A+ Selena Gomez 40 27,800,000 17 82,300,000 --
38
36 37th A+ Kris HC 1,049 27,800,000 1,405 1,200,000,000 --
39
37 38th A+ flighthouse 4,200 27,600,000 488 2,300,000,000 --
40
38 39th A+ wigofellas 1,251 27,500,000 812 707,200,000 --
41
39 40th A+ Savannah LaBrant 1,860 27,300,000 155 1,400,000,000 --
42
40 41st A+ noah beck 1,395 26,900,000 2,297 1,700,000,000 --
43
41 42nd A+ Liza Koshy 155 26,700,000 104 321,900,000 --
44
42 43rd A+ Kirya Kolesnikov 1,338 26,400,000 78 543,200,000 --
45
43 44th A+ Awez Darbar 2,708 26,100,000 208 1,100,000,000 --
46
44 45th A+ Carlos Feria 2,522 25,700,000 138 1,200,000,000 --
47
45 46th A+ Kira Kosarin 837 25,700,000 401 447,000,000 --
48
46 47th A+ Naim Darrechi🏆 2,634 25,300,000 527 2,200,000,000 --
49
47 48th A+ Josh Richards 1,899 24,900,000 9,847 1,600,000,000 --
50
48 49th A+ Q Park 231 24,800,000 3 294,100,000 --
51
49 50th A+ TikTok_India 186 24,500,000 191 40,100,000 --
52
And saves data.csv
(screenshot from LibreOffice):
EDIT: To get URL username:
JavaScript
1
35
35
1
import pandas as pd
2
from bs4 import BeautifulSoup
3
4
soup = BeautifulSoup(open("page.html", "r").read(), "html.parser")
5
6
data = []
7
for div in soup.select('div[style*="fafafa"], div[style*="f8f8f8"]'):
8
9
data.append(
10
[
11
d.get_text(strip=True)
12
for d in div.find_all("div", recursive=False)[:8]
13
]
14
+ [div.a["href"].split("/")[-1]]
15
)
16
17
18
df = pd.DataFrame(
19
data,
20
columns=[
21
"Rank",
22
"Grade",
23
"Username",
24
"Uploads",
25
"Followers",
26
"Following",
27
"Likes",
28
"Interactions",
29
"URL username",
30
],
31
)
32
33
print(df)
34
df.to_csv("data.csv", index=False)
35