I want to scrape data from a website concurrently, but I found that the following program is NOT executed concurrently.
async def return_soup(url): r = requests.get(url) r.encoding = "utf-8" soup = BeautifulSoup(r.text, "html.parser") future = asyncio.Future() future.set_result(soup) return future async def parseURL_async(url): print("Started to download {0}".format(url)) soup = await return_soup(url) print("Finished downloading {0}".format(url)) return soup loop = asyncio.new_event_loop() asyncio.set_event_loop(loop) t = [parseURL_async(url_1), parseURL_async(url_2)] loop.run_until_complete(asyncio.gather(*t))
However, this program starts to download the second content only after the first one finishes. If my understanding is correct, the await
keyword on the await return_soup(url)
awaits for the function to be complete, and while waiting for the completion, it returns back the control to the event loop, which enables the loop to start the second download.
And once the function finally finishes the execution, the future instance within it gets the result value.
But why does this not work concurrently? What am I missing here?
Advertisement
Answer
Using asyncio is different from using threads in that you cannot add it to an existing code base to make it concurrent. Specifically, code that runs in the asyncio event loop must not block – all blocking calls must be replaced with non-blocking versions that yield control to the event loop. In your case, requests.get
blocks and defeats the parallelism implemented by asyncio.
To avoid this problem, you need to use an http library that is written with asyncio in mind, such as aiohttp
.