Tag: scrapy

scraping name of dataset in kaggle using python

Hi, Please how can i get the name of dataset in kaggle, usign beatiful soup or selenium or scrapy. I test this code but no return : see the picture : inspect element from kaggle Answer Using Selenium Output: dataset snapshot

Python Scrapy -> Use a scrapy spider as a function

python scrapy web-scraping

so I have the following Scrapy Spider in spiders.py But the key aspect is that I want to call this spider as a function, in another file, instead of using scrapy crawl quotes in the console. Where can I read more on this, or whether this is possible at all? I checked through the Scrapy documentation, but I didn’t find

passing table name to pipeline scrapy python

python scrapy sqlite

I have different spiders that scrape similar values and I want to store the scraped values in different slite3 tables. I can do this by using a different pipeline for each spider but, since the only thing that changes is the table name, would it be possible to pass somehow the table name from the spider to the pipeline? This

how to remove unwanted text from retrieving title of a page using python

beautifulsoup python scrapy web-scraping

Hi All I have written a python program to retrieve the title of a page it works fine but with some pages, it also receives some unwanted text how to avoid that here is my program here is my output instead of this I suppose to receive only this line please help me with some idea all other websites are

What is this Scrapy error: ReactorNotRestartable?

python scrapy web-crawler

I do not understand why my spider wont run. I tested the css selector separately, so I do not think it is the parsing method. Traceback message: ReactorNotRestartable: Answer urls = “https://www.espn.com/college-football/team/_/id/52” for url in urls: You’re going through the characters of “urls”, change it to a list: Also you don’t have “parse_front” function, if you just didn’t add it

Why is Scrapy not following all rules / running all callbacks?

python scrapy web-scraping

I have two spiders inheriting from a parent spider class as follows: The parse_tournament_page callback for the Rule in first spider works fine. However, the second spider only runs the parse_tournament callback from the first Rule despite the fact that the second Rule is the same as the first spider and is operating on the same page. I’m clearly missing

Yielding values from consecutive parallel parse functions via meta in Scrapy

meta python scrapy yield

In my scrapy code I’m trying to yield the following figures from parliament’s website where all the members of parliament (MPs) are listed. Opening the links for each MP, I’m making parallel requests to get the figures I’m trying to count. I’m intending to yield each three figures below in the company of the name and the party of the

How to loop over multiple pages of a website using Scrapy

beautifulsoup python scrapy web-scraping

Hello everybody out there! I have been working with BeautifulSoup for my scraping projects. Currently, I’m learning Scrapy. I have written a code in BeautifulSoup to loop over multiple pages of a single website using for loops. I looped over 10 pages and fetched URLs of blog posts from those pages using the code below. I want to do the

During recursive scraping in scrapy, how extract info from multiple nodes of parent url and associated children url together?

python scrapy

The parent url got multiple nodes (quotes), each parent node got child url (author info). I am facing trouble linking the quote to author info, due to asynchronous nature of scrapy? How can I fix this issue, here’s the code so far. Added # <— comment for easy spot. Please note that in order to allow duplication, added DUPEFILTER_CLASS =

Scrapy can’t find items

module pipeline python scrapy web-crawler

I am currently still learning Scrapy and trying to work with pipelines and ItemLoader. However, I currently have the problem that the spider shows that Item.py does not exist. What exactly am I doing wrong and why am I not getting any data from the spider into my pipeline? Running the Spider without importing the items works fine. The Pipeline