I am trying to parse the table from this website. I started with just the Username
column and with the help I got on stackoverflow, I was able to get the content of Username
with the following code:
with open("Top 50 TikTok users sorted by Followers - Socialblade TikTok Stats _ TikTok Statistics.html", "r", encoding="utf-8") as file: soup = BeautifulSoup(str(file.readlines()), "html.parser") tiktok = [] for tag in soup.select("div div:nth-of-type(n+5) > div > a"): tiktok.append(tag.text)
which gives me
['addison rae', 'Bella Poarch', 'Zach King', 'TikTok', 'Spencer X', 'Will Smith', 'Loren Gray', 'dixie', 'Michael Le', 'Jason Derulo', 'Riyaz', . . .
My ultimate goal is to populate the entire table with [Rank, Grade, Username, Uploads, Followers, Following, Likes]
I have read a few articles on Parsing HTML Tables in Python with BeautifulSoup and pandas
but it didn’t work since this is not defined as a table in the source. What are some of the alternatives to get this as a table in Python?
Advertisement
Answer
You can use this code how to load the HTML from file to soup and then parse the table into dataframe:
import pandas as pd from bs4 import BeautifulSoup soup = BeautifulSoup(open("page.html", "r").read(), "html.parser") data = [] for div in soup.select('div[style*="fafafa"], div[style*="f8f8f8"]'): data.append( [ d.get_text(strip=True) for d in div.find_all("div", recursive=False)[:8] ] ) df = pd.DataFrame( data, columns=[ "Rank", "Grade", "Username", "Uploads", "Followers", "Following", "Likes", "Interactions", ], ) print(df) df.to_csv("data.csv", index=False)
Prints:
Rank Grade Username Uploads Followers Following Likes Interactions 0 1st A++ charli d’amelio 1,755 113,600,000 1,210 9,200,000,000 -- 1 2nd A++ addison rae 1,411 79,900,000 2,454 5,100,000,000 -- 2 3rd A++ Bella Poarch 282 63,600,000 588 1,400,000,000 -- 3 4th A++ Zach King 277 58,800,000 41 723,400,000 -- 4 5th A++ TikTok 139 52,900,000 495 250,300,000 91 5 6th A++ Spencer X 1,250 52,700,000 7,206 1,300,000,000 -- 6 7th A++ Will Smith 73 52,500,000 23 314,400,000 -- 7 8th A++ Loren Gray 2,805 52,100,000 221 2,800,000,000 -- 8 9th A++ dixie 120 51,200,000 1,267 2,900,000,000 -- 9 10th A++ Michael Le 1,158 47,400,000 93 1,300,000,000 -- 10 11th A+ Jason Derulo 675 44,900,000 12 1,000,000,000 -- 11 12th A+ Riyaz 2,056 44,100,000 43 2,100,000,000 -- 12 13th A+ Kimberly Loaiza ✨ 1,150 41,000,000 123 2,200,000,000 -- 13 14th A+ Brent Rivera 955 37,800,000 272 1,200,000,000 -- 14 15th A+ cznburak 1,301 37,300,000 1 688,700,000 -- 15 16th A+ The Rock 42 36,200,000 1 200,300,000 -- 16 17th A+ James Charles 238 36,200,000 148 881,400,000 -- 17 18th A+ BabyAriel 2,365 35,300,000 326 1,900,000,000 -- 18 19th A+ JoJo Siwa 1,206 33,500,000 346 1,100,000,000 -- 19 20th A+ avani 5,347 33,300,000 5,003 2,400,000,000 -- 20 21st A+ GIL CROES 693 32,900,000 454 803,200,000 -- 21 22nd A+ Faisal shaikh 461 32,200,000 -- 2,000,000,000 -- 22 23rd A+ BTS 39 32,000,000 -- 557,100,000 255 23 24th A+ LILHUDDY 4,187 30,500,000 8,652 1,600,000,000 -- 24 25th A+ Stokes Twins 548 30,100,000 21 781,000,000 -- 25 26th A+ Joe 1,487 29,800,000 8,402 1,200,000,000 -- 26 27th A+ ROD🥴 1,792 29,500,000 536 1,700,000,000 -- 27 28th A+ 𝙳𝚘𝚖𝚒𝚗𝚒𝚔 899 29,400,000 216 1,700,000,000 -- 28 29th A+ Kylie Jenner 69 29,400,000 14 318,800,000 -- 29 30th A+ Junya/じゅんや 2,823 29,000,000 1,934 533,800,000 12,200 30 31st A+ YZ 816 28,900,000 563 554,700,000 -- 31 32nd A+ Arishfa Khan🦁 2,026 28,600,000 27 1,100,000,000 -- 32 33rd A+ Lucas and Marcus 1,248 28,500,000 158 806,500,000 -- 33 34th A+ jannat_zubair29 1,054 28,200,000 6 746,300,000 47 34 35th A+ Nisha Guragain 1,751 28,000,000 33 756,300,000 -- 35 36th A+ Selena Gomez 40 27,800,000 17 82,300,000 -- 36 37th A+ Kris HC 1,049 27,800,000 1,405 1,200,000,000 -- 37 38th A+ flighthouse 4,200 27,600,000 488 2,300,000,000 -- 38 39th A+ wigofellas 1,251 27,500,000 812 707,200,000 -- 39 40th A+ Savannah LaBrant 1,860 27,300,000 155 1,400,000,000 -- 40 41st A+ noah beck 1,395 26,900,000 2,297 1,700,000,000 -- 41 42nd A+ Liza Koshy 155 26,700,000 104 321,900,000 -- 42 43rd A+ Kirya Kolesnikov 1,338 26,400,000 78 543,200,000 -- 43 44th A+ Awez Darbar 2,708 26,100,000 208 1,100,000,000 -- 44 45th A+ Carlos Feria 2,522 25,700,000 138 1,200,000,000 -- 45 46th A+ Kira Kosarin 837 25,700,000 401 447,000,000 -- 46 47th A+ Naim Darrechi🏆 2,634 25,300,000 527 2,200,000,000 -- 47 48th A+ Josh Richards 1,899 24,900,000 9,847 1,600,000,000 -- 48 49th A+ Q Park 231 24,800,000 3 294,100,000 -- 49 50th A+ TikTok_India 186 24,500,000 191 40,100,000 --
And saves data.csv
(screenshot from LibreOffice):
EDIT: To get URL username:
import pandas as pd from bs4 import BeautifulSoup soup = BeautifulSoup(open("page.html", "r").read(), "html.parser") data = [] for div in soup.select('div[style*="fafafa"], div[style*="f8f8f8"]'): data.append( [ d.get_text(strip=True) for d in div.find_all("div", recursive=False)[:8] ] + [div.a["href"].split("/")[-1]] ) df = pd.DataFrame( data, columns=[ "Rank", "Grade", "Username", "Uploads", "Followers", "Following", "Likes", "Interactions", "URL username", ], ) print(df) df.to_csv("data.csv", index=False)