Webscraping Dynamic Website to Pull Recent News Article URLs

Tags: , , , ,



I am attempting to pull investing news articles from a dynamic website using Python. I have tried a couple of tutorials that worked for static websites, but I have had issues pulling the URL to a specific article. The code I am working with is as follows:

    from requests_html import HTMLSession
    session = HTMLSession()
    
    r = session.get('https://www.institutionalinvestor.com/search?'
    'term=&' # eventually, the term would include the words I am actively searching for
    'filters=%7B"dates":%5B"last%20week"%5D%7D') # filter to the last week, this would eventually be for the last 24 hours only

    r.html.absolute_links

Which gets me a list of the links within the page in an array format:

{'https://www.institutionalinvestor.com/Login', 'https://www.institutionalinvestor.com/display-advertising', 'http://www.ttivanguard.com/', 'https://www.riaintel.com/', 'http://interactive.institutionalinvestor.com/executive-IR-research-em/about-586KX-2742AB.html', 'https://twitter.com/iimag', 'https://myaccount.institutionalinvestor.com/Orders/SelectPackage.html', 'https://www.institutionalinvestor.com/', 'https://www.institutionalinvestor.com/Corner-Office', 'https://www.institutionalinvestor.com/Management', 'http://iimemberships.com/', 'http://www.iiconferences.com/', 'https://www.institutionalinvestor.com/Register', 'https://www.institutionalinvestor.com/cookies', 'https://www.institutionalinvestor.com/Careers', 'https://www.institutionalinvestor.com/Custom-Research', 'https://www.institutionalinvestor.com/Portfolio', 'https://www.euromoneyplc.com/modern-slavery-act-transparency-statement', 'https://www.institutionalinvestor.com/research', 'https://www.institutionalinvestor.com/Masthead', 'https://www.institutionalinvestor.com/about-thought-leadership', 'https://www.institutionalinvestor.com/Investors', 'https://www.institutionalinvestor.com/Premium', 'https://www.institutionalinvestor.com/about-us', 'https://www.institutionalinvestor.com/thought-leadership', 'https://www.institutionalinvestor.com/PrivacyPolicy', 'https://www.institutionalinvestor.com/sponsored', 'https://www.institutionalinvestor.com/Video', 'https://www.institutionalinvestor.com/How-to-Pitch-Institutional-Investor', 'https://www.institutionalinvestor.com/FAQs', 'https://www.institutionalinvestor.com/Research-FAQs', 'https://www.institutionalinvestor.com/Reprints', 'https://www.institutionalinvestor.com/TermsConditions', 'https://www.linkedin.com/company/164389', 'https://www.facebook.com/iimag', 'https://www.institutionalinvestor.com/Customer-Service', 'https://www.institutionalinvestor.com/Culture', 'https://www.institutionalinvestor.com/awards', 'https://www.institutionalinvestor.com/Research-Insight', 'http://www.sovereignwealthcenter.com/'}

But I cannot find the links to the articles themselves. When I inspect the source code, this is what I see:

<div class="search-results" role="listbox">
                        <article class="search-result" ng-repeat="article in serverData.hits.results">
                            <div class="search-result-text-ghost"></div>
                            <h2 ng-class="article|publicationClass"><a ng-href="{{article|articleHref}}">{{article|snippet:'title'|removeHtmlTags}}</a>
                            </h2>

As someone relatively new to HTML, that h2 section towards the end leads me to believe that the site is dynamic, which is where I am stuck. Any help would be appreciated. My ideal output for this question is to get the title of the article, the source (in this case “Institutional Investor”), a preview of the article (the first couple of lines or so, and the URL for the article into a dataframe that can be sent to me each morning to save time I would otherwise spend manually pulling news. I have put together the rest of the project, outside of the news pull for sites such as Institutional Investor that are not included in an API I am using.

I am open to any and all new methods, if necessary or recommended. Thank you in advance!

Answer

Ya it is dynamic. You could use selenium to allow the page to first render, then pull out the html like you’d normally do with a static site. Or, its all there with their api (I think even the full article is in there too but I just pulled out what you asked for):

import requests
import json
import pandas as pd

api = 'https://search.euromoneyapi.com/api/Search'

headers= {'content-type': 'application/json',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}

payload = {"site":"amg_ii",
           "suggester":'true',
           "from":0,
           "size":10,
           "sort":"dates",
           "sort_order":"desc"}

data = {"site":"amg_ii","suggester":True,"from":0,"size":10,"sort":"dates","sort_order":"desc"}

jsonData = requests.post(api, headers=headers, data=json.dumps(data)).json()

rows = []
articles = jsonData['hits']['results']
for article in articles:
    title = article['snippet']['title'][0]
    source = 'https://www.institutionalinvestor.com/'
    try:
        preview = article['snippet']['description'][0]
    except:
        preview = ''
    url = 'https://www.institutionalinvestor.com/article/' + article['id'].split('/')[-1] + '/' + article['fields']['url_title'][0]
   
    row = {'title':title,
           'source':source,
           'preview':preview,
           'url':url}
    rows.append(row)
    
df = pd.DataFrame(rows)

Output:

print (df.to_string())
                                                                       title                                  source                                                                                                                                                                        preview                                                                                                                                     url
0                                                            Who’s on Third?  https://www.institutionalinvestor.com/                                                                  Third-party claims filing service providers require due diligence for shareholder litigation outside the U.S                                                              https://www.institutionalinvestor.com/article/b1pqxvgpm3dwjb/Who-s-on-Third
1                      First the Cyberattack Hits. Then the Insider Trading.  https://www.institutionalinvestor.com/                                                                                         Researchers share their striking evidence of pre-disclosure spikes in options trading.                        https://www.institutionalinvestor.com/article/b1pzfhkhcv70m1/First-the-Cyberattack-Hits-Then-the-Insider-Trading
2                         Hedge Funds Featured Prominently in 2020 SPAC Boom  https://www.institutionalinvestor.com/  Nearly 13 percent of the blank check companies that filed plans to go public in 2020 were sponsored by hedge fund firms or individuals formerly associated with the industry.                         https://www.institutionalinvestor.com/article/b1pzg04d0bbvxz/Hedge-Funds-Featured-Prominently-in-2020-SPAC-Boom
3                            The Stocks That Drove Glenview’s Major Comeback  https://www.institutionalinvestor.com/                                                             Larry Robbins’ hedge fund finished 2020 solidly positive thanks to huge gains in the final two months of the year.                            https://www.institutionalinvestor.com/article/b1pzf7qb428t3x/The-Stocks-That-Drove-Glenview-s-Major-Comeback
4                                         Bill Ackman’s Billion-Dollar Year  https://www.institutionalinvestor.com/                                                                                                     A big short and a big SPAC fueled hefty gains for Pershing Square in 2020.                                          https://www.institutionalinvestor.com/article/b1pzgx69sxhstk/Bill-Ackman-s-Billion-Dollar-Year
5                              Ex-Verger Interns Make NFL, ‘Bachelor’ Debuts  https://www.institutionalinvestor.com/                                                                  Verger Capital Management CIO Jim Dunn shared the inside story on former interns John Wolford and Matt James.                                 https://www.institutionalinvestor.com/article/b1pzg3qjq9xt5x/Ex-Verger-Interns-Make-NFL-Bachelor-Debuts
6  David Einhorn’s Greenlight Capital Pulls Off a Coup in the Fourth Quarter  https://www.institutionalinvestor.com/                                                                                          The manager turned in a strong fourth quarter by sticking with his biggest positions.  https://www.institutionalinvestor.com/article/b1pyl5mtkmpt80/David-Einhorn-s-Greenlight-Capital-Pulls-Off-a-Coup-in-the-Fourth-Quarter
7                                             Gold's 2020 Ride Explained  https://www.institutionalinvestor.com/                                                                                                                                                                                                                               https://www.institutionalinvestor.com/article/b1psmn58mppsyj/gold39s-2020-ride-explained
8                                     The ARK Invest Takeover Battle Is Over  https://www.institutionalinvestor.com/                                                                                A new deal has “extinguished” Resolute’s option to acquire an additional stake in the ETF firm.                                     https://www.institutionalinvestor.com/article/b1pw88ldyr905m/The-ARK-Invest-Takeover-Battle-Is-Over
9                           Investors Quickly Saw Big Gains From These SPACs  https://www.institutionalinvestor.com/                                                                                                      At least two blank-check companies surged on recent merger announcements.                           https://www.institutionalinvestor.com/article/b1pt6fl7c9dsqc/Investors-Quickly-Saw-Big-Gains-From-These-SPACs


Source: stackoverflow