Skip to content
Advertisement

The tasks from asyncio.gather does not work concurrently

I want to scrape data from a website concurrently, but I found that the following program is NOT executed concurrently.

async def return_soup(url):
    r = requests.get(url)
    r.encoding = "utf-8"
    soup = BeautifulSoup(r.text, "html.parser")

    future = asyncio.Future()
    future.set_result(soup)
    return future

async def parseURL_async(url):    
    print("Started to download {0}".format(url))
    soup = await return_soup(url)
    print("Finished downloading {0}".format(url))

    return soup

loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
t = [parseURL_async(url_1), parseURL_async(url_2)]
loop.run_until_complete(asyncio.gather(*t))

However, this program starts to download the second content only after the first one finishes. If my understanding is correct, the await keyword on the await return_soup(url) awaits for the function to be complete, and while waiting for the completion, it returns back the control to the event loop, which enables the loop to start the second download.

And once the function finally finishes the execution, the future instance within it gets the result value.

But why does this not work concurrently? What am I missing here?

Advertisement

Answer

Using asyncio is different from using threads in that you cannot add it to an existing code base to make it concurrent. Specifically, code that runs in the asyncio event loop must not block – all blocking calls must be replaced with non-blocking versions that yield control to the event loop. In your case, requests.get blocks and defeats the parallelism implemented by asyncio.

To avoid this problem, you need to use an http library that is written with asyncio in mind, such as aiohttp.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement