I am trying to recover web links from an RSS page. I am using Python3, requests,and BeautifulSoup4, on a Windows 10 system. My code is as follows:
rSS = "http://www.example.com/xml/rss/all.xml" mYHeaders = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'} SourcePage = requests.get(rSS, headers = mYHeaders, timeout=(5,10)) SourceText = SourcePage.text soup = BeautifulSoup(SourceText, 'html.parser') Articles = soup.findAll('item') for i in Articles: Title = i.title Link = i.link Pub = i.pubdate print('Title: ', Title) print('Link: ', Link) print('Pub: ', Pub)
This prints out as follows:
Title: <title>There is some text here</title> Link: <link/> Pub: <pubdate>Sat, 06 Feb 2021 10:22:41 +0000</pubdate>
Individual items in Articles are of the following form:
<item> <link/>https://www.example.com/news/2021/2/6/blahblah <title>Some title text here</title> <description><![CDATA[Some text here' and here.]]></description> <pubdate>Sat, 06 Feb 2021 11:58:23 +0000</pubdate> <category>News</category> <guid ispermalink="false">https://www.example.com/?t=1234567</guid> </item>
The problem is with
<link/>
as it is not captured in the appropriate form i.e.
<link>...</link>
When I open the same link (rSS above) in my browser (Firefox), the link tags are being shown correctly:
<item> <link> https://www.example.com/blah/blah </link> <title> Some title text here. </title> <description> Some description here. </description> <pubDate>Sun, 07 Feb 2021 08:03:48 +0000</pubDate> <category>News</category> <guid isPermaLink="false">https://www.example.com/?t=123456</guid> </item>
I am guessing the problem lies with using the html.parser for an xml page. If I need to use some xml parser, could you guide me which one to use on Python3. The code would be running on a raspberry pi, but I am developing it on Windows10.
Thanks in advance for a solution!
Advertisement
Answer
Since <link></link>
tag is converted into a <link/>
, You need to use .next_sibling
to get the link you need. Code will look something like this:
... for i in Articles: Title = i.title Link = i.link.next_sibling Pub = i.pubdate print('Title: ', Title) print('Link: ', Link) print('Pub: ', Pub)
Also, if you want to get just the Title and Pub without tags, use .text
.