Skip to content
Advertisement

Better way of capturing multiple same tags?

I’m trying to create an scraper which scrapes download links, I want to use regex but that would be a nightmare for me to do, I’ve found this library which is called BeautifulSoup, I’m trying to capture the urls in the children of div class="article-content" which is <p> tag, and this <h3> is the name of the urls,I don’t want to combine all urls in one list but instead I used dictionary which is key is the name(<h3>) and value is the list of urls, enough of the talk here is the code.

import requests
from bs4 import BeautifulSoup

def scrape():
    resp = requests.get('https://www.animeout.xyz/love-live-nijigasaki-gakuen-school-idol-doukoukai-1080p-300mb720p-150mbepisode-1/')
    soup = BeautifulSoup(resp.text,'html.parser')
    contents = soup.find('div',class_='article-content')
    output = {}
    for tag in contents.children:
        if tag.name == 'h3':
            name = tag.text
            links = []
            for sibling in tag.next_siblings:
                if sibling.name == 'p':
                    for link in sibling.find_all('a',text='Direct Download'):
                        links.append(link.get('href'))
                if sibling.name == 'h3':
                    output.update({name:links})
                    break

so far I only managed to capture only 1 key, is there a pythonic way to do this?

Advertisement

Answer

You might want to try this:

import json
import re

import requests
from bs4 import BeautifulSoup


def scrape(source_url):
    soup = BeautifulSoup(
        requests.get(source_url).text,
        'html.parser',
    )
    headers = [
        h.getText() for h in soup.find_all("h3") if "Direct" in h.getText()
    ]
    links = [
        anchor["href"] for anchor
        in soup.find_all(lambda t: t.name == "a" and "Direct" in t.text)
    ]
    return {
        header: [
            link for link in links
            if re.search(r"d{3,4}p", header).group(0) in link
        ] for header in headers
    }


data = scrape("https://www.animeout.xyz/love-live-nijigasaki-gakuen-school-idol-doukoukai-1080p-300mb720p-150mbepisode-1/")

print(json.dumps(data, indent=2))


The reason you have one key only is that keys have to be unique but the names of the links are not. Change this with something unique, for example, an index number or the series title with the resolution.

Sample output:

{
  "Love Live! Nijigasaki Gakuen School Idol Doukoukai (main) Direct Download Links (300MB u2013 1080p)(Encoded)": [
    "http://nimbus.animeout.com/series/00RAPIDBOT/Love Live Nijigasaki Gakuen School Idol Doukoukai/[AnimeOut] Love Live Nijigasaki Gakuen School Idol Doukoukai - 01 [1080pp][1080pp][Erai-raws][RapidBot].mkv",
    "http://nimbus.animeout.com/series/00RAPIDBOT/Love Live Nijigasaki Gakuen School Idol Doukoukai/[AnimeOut] Love Live Nijigasaki Gakuen School Idol Doukoukai - 01 [v2][1080pp][1080pp][Erai-raws][RapidBot].mkv",
    "http://nimbus.animeout.com/series/00RAPIDBOT/Love Live Nijigasaki Gakuen School Idol Doukoukai/[AnimeOut] Love Live Nijigasaki Gakuen School Idol Doukoukai - 01 [1080pp][1080pp][Erai-raws][RapidBot].mkv",

and so on ...
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement