Skip to content
Advertisement

Web scraping a text() in python

I am having trouble with a web scraping function. The XPath for the two things I am trying to get are

/html/body/div/table[2]/tbody/tr[5]/td[1]/div[1]/ul/li[1]/text()
/html/body/div/table[2]/tbody/tr[5]/td[1]/div[1]/ul/li[1]/a

The html is

<li><a href="http://www.acu.edu/" target="_blank" class="institution">Abilene Christian University</a> (TX)</li>

I am trying to have a function to loop through each li in tr[5]. The problem I am having is getting the text(). I have tried a number of different variations of this function

from lxml.html import parse
from urllib2 import urlopen
def _clean(lst):
    for elm in lst:
        lnk=elm.findall('.//a')
        for this in lnk:
            lnk_txt.append(this.text_content())
        state_txt.append(elm.findall('.//text()'))

This specific function returns an KeyError on the ‘()’. If I remove (), it returns a list of empty elements. The lnk_txt works.

What I am trying to get are two list. One is the name of the University. The other is the location of the University. The ultimate goal is to make tuples (name, state).

Advertisement

Answer

You need to find the following text sibling of the a element:

lnk.xpath("following-sibling::text()")

Demo:

>>> import lxml.html
>>> data = '<li><a href="http://www.acu.edu/" target="_blank" class="institution">Abilene Christian University</a> (TX)</li>'
>>> li = lxml.html.fromstring(data)
>>> li.xpath("//a[@class='institution']/following-sibling::text()")[0].strip()
'(TX)'
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement