I am trying to recover web links from an RSS page. I am using Python3, requests,and BeautifulSoup4, on a Windows 10 system. My code is as follows: This prints out as follows: Individual items in Articles are of the following form: The problem is with as it is not captured in the appropriate form i.e. When I open the same

How to recover http link from a tag

I am trying to recover web links from an RSS page. I am using Python3, requests,and BeautifulSoup4, on a Windows 10 system. My code is as follows:

rSS = "http://www.example.com/xml/rss/all.xml"
mYHeaders = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'}
SourcePage = requests.get(rSS, headers = mYHeaders, timeout=(5,10))
SourceText = SourcePage.text
soup = BeautifulSoup(SourceText, 'html.parser')
Articles = soup.findAll('item')
for i in Articles:
    Title = i.title
    Link = i.link
    Pub = i.pubdate
    print('Title: ', Title)
    print('Link: ', Link)
    print('Pub: ', Pub)

JavaScript
​x
 
rSS = "http://www.example.com/xml/rss/all.xml"
mYHeaders = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'}
SourcePage = requests.get(rSS, headers = mYHeaders, timeout=(5,10))
SourceText = SourcePage.text
soup = BeautifulSoup(SourceText, 'html.parser')
Articles = soup.findAll('item')
for i in Articles:
    Title = i.title
    Link = i.link
    Pub = i.pubdate
    print('Title: ', Title)
    print('Link: ', Link)
    print('Pub: ', Pub)
​

This prints out as follows:

Title:  <title>There is some text here</title>
Link:  <link/>
Pub:  <pubdate>Sat, 06 Feb 2021 10:22:41 +0000</pubdate>

JavaScript
 
Title:  <title>There is some text here</title>
Link:  <link/>
Pub:  <pubdate>Sat, 06 Feb 2021 10:22:41 +0000</pubdate>
​

Individual items in Articles are of the following form:

<item>
<link/>https://www.example.com/news/2021/2/6/blahblah
                <title>Some title text here</title>
<description><![CDATA[Some text here' and here.]]></description>
<pubdate>Sat, 06 Feb 2021 11:58:23 +0000</pubdate>
<category>News</category>
<guid ispermalink="false">https://www.example.com/?t=1234567</guid>
</item>

JavaScript
 
<item>
<link/>https://www.example.com/news/2021/2/6/blahblah
                <title>Some title text here</title>
<description><![CDATA[Some text here' and here.]]></description>
<pubdate>Sat, 06 Feb 2021 11:58:23 +0000</pubdate>
<category>News</category>
<guid ispermalink="false">https://www.example.com/?t=1234567</guid>
</item>
​

The problem is with

<link/>

JavaScript
 
<link/> 
​

as it is not captured in the appropriate form i.e.

<link>...</link>

JavaScript
 
<link>...</link>
​

When I open the same link (rSS above) in my browser (Firefox), the link tags are being shown correctly:

<item>
<link>
https://www.example.com/blah/blah
</link>
<title>
Some title text here.
</title>
<description>
Some description here.
</description>
<pubDate>Sun, 07 Feb 2021 08:03:48 +0000</pubDate>
<category>News</category>
<guid isPermaLink="false">https://www.example.com/?t=123456</guid>
</item>

JavaScript
 
<item>
<link>
https://www.example.com/blah/blah
</link>
<title>
Some title text here.
</title>
<description>
Some description here.
</description>
<pubDate>Sun, 07 Feb 2021 08:03:48 +0000</pubDate>
<category>News</category>
<guid isPermaLink="false">https://www.example.com/?t=123456</guid>
</item>
​

I am guessing the problem lies with using the html.parser for an xml page. If I need to use some xml parser, could you guide me which one to use on Python3. The code would be running on a raspberry pi, but I am developing it on Windows10.

Thanks in advance for a solution!

Answer

Since <link></link> tag is converted into a <link/>, You need to use .next_sibling to get the link you need. Code will look something like this:

...
for i in Articles:
    Title = i.title
    Link = i.link.next_sibling
    Pub = i.pubdate
    print('Title: ', Title)
    print('Link: ', Link)
    print('Pub: ', Pub)

JavaScript
 
...
for i in Articles:
    Title = i.title
    Link = i.link.next_sibling
    Pub = i.pubdate
    print('Title: ', Title)
    print('Link: ', Link)
    print('Pub: ', Pub)
​

Also, if you want to get just the Title and Pub without tags, use .text.

Advertisement

Answer