I have been trying to parse out text without any tags. Wanted to build a little scraping tool for myself to help find good DND games to play on Roll20 (I was going to take this data and attach it to a table within each link for the final goal).
The URL I am parsing out info is here: Roll20 Link
I had an idea to try to parse out the text and then put each new line into a list of its own and grab the elements needed. I wanted to grab the info on the game, current players, and current open slots. Here is the code I have done so far. Any suggestions on what I might need to do to scrape this particular data?
Here is my code:
import requests from bs4 import BeautifulSoup import time headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36'} url = r'https://app.roll20.net/lfg/search//?page=0&days=thursday,friday&dayhours=1652932800,1653019200&frequency=onceweekly,biweekly,monthly&timeofday=&timeofday_seconds=&language=English&avpref=Any&gametype=Any&newplayer=false&yesmaturecontent=false&nopaytoplay=false&playingstructured=dnd_next&sortby=relevance&for_event=&roll20con=' r = requests.get(url, headers = headers) soup = BeautifulSoup(r.text, 'html.parser') time.sleep(2) games= soup.find_all('tr', {'class': 'lfglisting'}) game_urls = [] for item in games: # item_title = item.find('a', {'class': 'lfglistingname'}).text # item_url = 'https://app.roll20.net' + item.find('a', {'class': 'lfglistingname'})['href'] current_players = item.get_text("n", strip=True) print(current_players) # try: # item_game = item.find('strong', {'class': 'label label-success'}).text # except: # item_game = 'Role-Playing Game' # try: # item_pay = item.find('strong', {'class': 'label label-danger'}).text # except: # item_pay = 'Free to Play' # try: # item_welcome = item.find('strong', {'class': 'label label-info'}).text # except: # item_welcome = 'Experts Only' # print(f"Game: {item_title}. URL: {item_url}. Notes on Game: {item_game}, {item_pay}, {item_welcome}") # game_urls.append(item_url) # print(game_urls)
Advertisement
Answer
I started off by looking at the source code of the page, and searching for a know string. (like part of a game description).
it seems every description is inside a <td class='gminfo'>
but, its parent element, the <tr>
, is more intresting as it contains all the desired data.
Notice all of these <tr>
tags have something in common – the data-listingid
attribute.
so let’s get all of those.
for x in soup.select('tr[data-listingid]'): print(x.text.strip())
then, we start parsing, with regex.
import re def print_data(dct): for item, amount in dct.items(): print(f"{item} {'-'*(30 - len(item))} {amount}") soup = BeautifulSoup(r.text, 'html.parser') listings = soup.select('tr[data-listingid]') listings_count = len(listings) print (f"Expecting {listings_count} listings") parsed_listings = [] for listing in listings: game = listing.text.strip() try: name = re.search("n{6}(.*)",game).group(1) info = re.search("n{3} (.*)",game).groups(1)[0] + "..." curent_players = re.search("(.*) Current Players",game).groups(0)[0] open_slots = re.search("((.*) Open Slots",game).groups(0)[0] game = {"Name": name, "Info": info, "Current_Players": curent_players, "Open_Slots": open_slots} parsed_listings.append(game) print_data(game) print ("n=======n") except Exception as e: # print (e) pass print (f"parsed {len(parsed_listings)} of {listings_count} total")
Gives:
Expecting 30 listings Name -------------------------- Curse of Strahd - Grim Hollow/High RP Info -------------------------- Take this opportunity to play the most popular D&D module ever made with an expert DM who cares about your backstory and wants to... Current_Players --------------- 1 Open_Slots -------------------- 5 ======= Name -------------------------- The Dragon of Icespire Peak (Monday) Info -------------------------- Dragon of Icespire Peak is the introductory adventure for the 5th Edition Starter Set, designed for PC levels 1 – 6. It is a... Current_Players --------------- 1 Open_Slots -------------------- 6 ======= Name -------------------------- Necropolis Info -------------------------- What ancient horrors lie slumbering in a newly discovered tomb deep in Egypt's Valley of the Kings? Are you allowing local superstitions and the... Current_Players --------------- 1 Open_Slots -------------------- 4 ======= Name -------------------------- Weekly One-shots (Monday) Info -------------------------- My car for my primary means of income (Uber) has died and I'm **urgently** trying to raise funds to replace it. If you'd like... Current_Players --------------- 1 Open_Slots -------------------- 7 ======= Name -------------------------- dragonball z Info -------------------------- hello all those to whom love dragonball z! i have never DM before but i am willing to give it a chance. im trying... Current_Players --------------- 1 Open_Slots -------------------- 3 ======= Name -------------------------- Weekly One-shots (Monday) Info -------------------------- My car for my primary means of income (Uber) has died and I'm **urgently** trying to raise funds to replace it. If you'd like... Current_Players --------------- 1 Open_Slots -------------------- 7 ======= Name -------------------------- Larula's Tomb Info -------------------------- 3 Hour, Level 3 One Shot. Gritty, old school feel. Death possible. Backup characters provided. Roll 3d6 straight for stats. Roll for HP. The... Current_Players --------------- 1 Open_Slots -------------------- 6 ======= Name -------------------------- Vast Stories of Erstonia Info -------------------------- Vast Stories of Erstonia is a D&D 5e group devoted to playing a series of oneshots provided by the DM. The adventures will be... Current_Players --------------- 1 Open_Slots -------------------- 4 ======= Name -------------------------- Beasts of Fortune 2 Info -------------------------- The Beasts of Fortune seeks adventures seeking fame, fortune, honor, or just a reason to smack some heads, come one come all to join... Current_Players --------------- 1 Open_Slots -------------------- 20 ======= ... parsed 22 of 30 total
this is by no means a perfect solution, the parsing isn’t perfect at all, but it should get you going.
Of course run this over each page # you want. (the /?page=0
in the url)
If you want the full description of the listing, you’re gonna have to GET it, specifically the Read More <a>
tag.
but then you can’t use listing.text
as it strips it away.
Also, this isn’t legal advice or anything, but I wouldn’t be surprised if this is against their site policy, so be wary.