Skip to content
Advertisement

Parsing out text without a tag

I have been trying to parse out text without any tags. Wanted to build a little scraping tool for myself to help find good DND games to play on Roll20 (I was going to take this data and attach it to a table within each link for the final goal).

The URL I am parsing out info is here: Roll20 Link

I had an idea to try to parse out the text and then put each new line into a list of its own and grab the elements needed. I wanted to grab the info on the game, current players, and current open slots. Here is the code I have done so far. Any suggestions on what I might need to do to scrape this particular data?

Here is my code:

JavaScript

Advertisement

Answer

I started off by looking at the source code of the page, and searching for a know string. (like part of a game description). it seems every description is inside a <td class='gminfo'> but, its parent element, the <tr>, is more intresting as it contains all the desired data. Notice all of these <tr> tags have something in common – the data-listingid attribute.

so let’s get all of those.

JavaScript

then, we start parsing, with regex.

JavaScript

Gives:

JavaScript

this is by no means a perfect solution, the parsing isn’t perfect at all, but it should get you going.

Of course run this over each page # you want. (the /?page=0 in the url) If you want the full description of the listing, you’re gonna have to GET it, specifically the Read More <a> tag.

enter image description here

but then you can’t use listing.text as it strips it away.

Also, this isn’t legal advice or anything, but I wouldn’t be surprised if this is against their site policy, so be wary.

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement