I’m trying to create an scraper which scrapes download links, I want to use regex but that would be a nightmare for me to do, I’ve found this library which is called BeautifulSoup, I’m trying to capture the urls in the children of div class="article-content"
which is <p>
tag, and this <h3>
is the name of the urls,I don’t want to combine all urls in one list but instead I used dictionary which is key is the name(<h3>
) and value is the list of urls, enough of the talk here is the code.
import requests from bs4 import BeautifulSoup def scrape(): resp = requests.get('https://www.animeout.xyz/love-live-nijigasaki-gakuen-school-idol-doukoukai-1080p-300mb720p-150mbepisode-1/') soup = BeautifulSoup(resp.text,'html.parser') contents = soup.find('div',class_='article-content') output = {} for tag in contents.children: if tag.name == 'h3': name = tag.text links = [] for sibling in tag.next_siblings: if sibling.name == 'p': for link in sibling.find_all('a',text='Direct Download'): links.append(link.get('href')) if sibling.name == 'h3': output.update({name:links}) break
so far I only managed to capture only 1 key, is there a pythonic way to do this?
Advertisement
Answer
You might want to try this:
import json import re import requests from bs4 import BeautifulSoup def scrape(source_url): soup = BeautifulSoup( requests.get(source_url).text, 'html.parser', ) headers = [ h.getText() for h in soup.find_all("h3") if "Direct" in h.getText() ] links = [ anchor["href"] for anchor in soup.find_all(lambda t: t.name == "a" and "Direct" in t.text) ] return { header: [ link for link in links if re.search(r"d{3,4}p", header).group(0) in link ] for header in headers } data = scrape("https://www.animeout.xyz/love-live-nijigasaki-gakuen-school-idol-doukoukai-1080p-300mb720p-150mbepisode-1/") print(json.dumps(data, indent=2))
The reason you have one key only is that keys have to be unique but the names of the links are not. Change this with something unique, for example, an index number or the series title with the resolution.
Sample output:
{ "Love Live! Nijigasaki Gakuen School Idol Doukoukai (main) Direct Download Links (300MB u2013 1080p)(Encoded)": [ "http://nimbus.animeout.com/series/00RAPIDBOT/Love Live Nijigasaki Gakuen School Idol Doukoukai/[AnimeOut] Love Live Nijigasaki Gakuen School Idol Doukoukai - 01 [1080pp][1080pp][Erai-raws][RapidBot].mkv", "http://nimbus.animeout.com/series/00RAPIDBOT/Love Live Nijigasaki Gakuen School Idol Doukoukai/[AnimeOut] Love Live Nijigasaki Gakuen School Idol Doukoukai - 01 [v2][1080pp][1080pp][Erai-raws][RapidBot].mkv", "http://nimbus.animeout.com/series/00RAPIDBOT/Love Live Nijigasaki Gakuen School Idol Doukoukai/[AnimeOut] Love Live Nijigasaki Gakuen School Idol Doukoukai - 01 [1080pp][1080pp][Erai-raws][RapidBot].mkv", and so on ...