Web scraping a text() in python

I am having trouble with a web scraping function. The XPath for the two things I am trying to get are

/html/body/div/table[2]/tbody/tr[5]/td[1]/div[1]/ul/li[1]/text()
/html/body/div/table[2]/tbody/tr[5]/td[1]/div[1]/ul/li[1]/a

The html is

<li><a href="http://www.acu.edu/" target="_blank" class="institution">Abilene Christian University</a> (TX)</li>

I am trying to have a function to loop through each li in tr[5]. The problem I am having is getting the text(). I have tried a number of different variations of this function

from lxml.html import parse
from urllib2 import urlopen
def _clean(lst):
    for elm in lst:
        lnk=elm.findall('.//a')
        for this in lnk:
            lnk_txt.append(this.text_content())
        state_txt.append(elm.findall('.//text()'))

This specific function returns an KeyError on the ‘()’. If I remove (), it returns a list of empty elements. The lnk_txt works.

What I am trying to get are two list. One is the name of the University. The other is the location of the University. The ultimate goal is to make tuples (name, state).

Answer

You need to find the following text sibling of the a element:

lnk.xpath("following-sibling::text()")

Demo:

>>> import lxml.html
>>> data = '<li><a href="http://www.acu.edu/" target="_blank" class="institution">Abilene Christian University</a> (TX)</li>'
>>> li = lxml.html.fromstring(data)
>>> li.xpath("//a[@class='institution']/following-sibling::text()")[0].strip()
'(TX)'

Advertisement

Answer