Skip to content
Advertisement

Scraping Crunchbase to extract corporate news

I’m trying to scrape the news and signals tab from Crunchbase, and having no joy.

Having consulted prior threads on Stackoverflow, I have been using this code that has worked well for all other tabs (taking duolingo as an example):

website2 = "https://www.crunchbase.com/organization/duolingo/signals_and_news"

headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate", "DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}
response2 = requests.get(website2, headers=headers)

print(response2.content)

I suspect it’s something to do with how Crunchbase has coded-up the news section, which probably requires a tweak to my header, but I’m not sure what I need do.

I’d be really grateful if anyone can help. Many thanks!

Advertisement

Answer

Seems like news articles are generated dynamically in the backaground by javascript.

If you take a look at your web-inspector when loading your page you can see a request being made:

enter image description here

You can see it returns JSON data for news articles:

enter image description here

You have to replicate this request in your scraper code:

import requests

headers = {
    'accept': 'application/json, text/plain, */*',
    'x-requested-with': 'XMLHttpRequest',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) QtWebEngine/5.15.2 Chrome/87.0.4280.144 Safari/537.36',
    'content-type': 'application/json',
    'accept-language': 'en-US,en;q=0.9',
}

data = {'field_ids': ['activity_properties',
               'entity_def_id',
               'identifier',
               'activity_date',
               'activity_entities'],
 'limit': 10,
 'order': [],
 'query': [{'field_id': 'activity_entities',
            'operator_id': 'includes',
            'type': 'predicate',
             # this value is company page id, can be found in the html of original url
            'values': ['c999a7f8-6a98-144a-e29f-05fb6df60f73']}]}

response = crequests.post('https://www.crunchbase.com/v4/data/searches/activities', headers=headers, data=data)

For more on reverse engineering websites with this method see my full blog post article here: https://scrapecrow.com/reverse-engineering-intro.html

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement