I am trying to recover web links from an RSS page. I am using Python3, requests,and BeautifulSoup4, on a Windows 10 system. My code is as follows:
rSS = "http://www.example.com/xml/rss/all.xml"
mYHeaders = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'}
SourcePage = requests.get(rSS, headers = mYHeaders, timeout=(5,10))
SourceText = SourcePage.text
soup = BeautifulSoup(SourceText, 'html.parser')
Articles = soup.findAll('item')
for i in Articles:
Title = i.title
Link = i.link
Pub = i.pubdate
print('Title: ', Title)
print('Link: ', Link)
print('Pub: ', Pub)
This prints out as follows:
Title: <title>There is some text here</title>
Link: <link/>
Pub: <pubdate>Sat, 06 Feb 2021 10:22:41 +0000</pubdate>
Individual items in Articles are of the following form:
<item>
<link/>https://www.example.com/news/2021/2/6/blahblah
<title>Some title text here</title>
<description><![CDATA[Some text here' and here.]]></description>
<pubdate>Sat, 06 Feb 2021 11:58:23 +0000</pubdate>
<category>News</category>
<guid ispermalink="false">https://www.example.com/?t=1234567</guid>
</item>
The problem is with
<link/>
as it is not captured in the appropriate form i.e.
<link></link>
When I open the same link (rSS above) in my browser (Firefox), the link tags are being shown correctly:
<item>
<link>
https://www.example.com/blah/blah
</link>
<title>
Some title text here.
</title>
<description>
Some description here.
</description>
<pubDate>Sun, 07 Feb 2021 08:03:48 +0000</pubDate>
<category>News</category>
<guid isPermaLink="false">https://www.example.com/?t=123456</guid>
</item>
I am guessing the problem lies with using the html.parser for an xml page. If I need to use some xml parser, could you guide me which one to use on Python3. The code would be running on a raspberry pi, but I am developing it on Windows10.
Thanks in advance for a solution!
Advertisement
Answer
Since <link></link>
tag is converted into a <link/>
, You need to use .next_sibling
to get the link you need. Code will look something like this:
for i in Articles:
Title = i.title
Link = i.link.next_sibling
Pub = i.pubdate
print('Title: ', Title)
print('Link: ', Link)
print('Pub: ', Pub)
Also, if you want to get just the Title and Pub without tags, use .text
.