I have a lxml etree HTMLParser object that I’m trying to build xpaths with to assert xpaths, attributes of the xpath and text of that tag. I ran into a problem when the text of the tag has either single-quotes(‘) or double-quotes(“) and I’ve exhausted all my options.
Here’s a sample object I created
parser = etree.HTMLParser() tree = etree.parse(StringIO(<html><body><p align="center">Here is my 'test' "string"</p></body></html>), parser)
Here is the snippet of code and then different variations of the variable being read in
def getXpath(self) xpath += 'starts-with(., '' + self.text + '') and ' xpath += ('count(@*)=' + str(attrsCount) if self.exactMatch else "1=1") + ']'
self.text is basically the expected text of the tag, in this case: Here is my ‘test’ “string”
this fails when i try to use the xpath method of the HTMLParser object
tree.xpath(self.getXpath())
Reason is because the xpath that it gets is this ‘/html/body/p[starts-with(.,’Here is my ‘test’ “string”‘) and 1=1]’
How can I properly escape the single and double quotes from the self.text variable? I’ve tried triple quoting, wrapping self.text in repr(), or doing a re.sub or string.replace escaping ‘ and ” with ‘ and “
Advertisement
Answer
According to what we can see in Wikipedia and w3 school, you should not have '
and "
in nodes content, even if only <
and &
are said to be stricly illegal. They should be replaced by corresponding “predefined entity references”, that are '
and "
.
By the way, the Python parsers I use will take care of this transparently: when writing, they are replaced; when reading, they are converted.
After a second reading of your answer, I tested some stuff with the '
and so on in Python interpreter. And it will escape everything for you!
>>> 'text {0}'.format('blabla "some" bla') 'text blabla "some" bla' >>> 'ntsnts {0}'.format("ontsi'tns") "ntsnts ontsi'tns" >>> 'ntsnts {0}'.format("ontsi'tn' "ntsis") 'ntsnts ontsi'tn' "ntsis'
So we can see that Python escapes things correctly. Could you then copy-paste the error message you get (if any)?