Substring any kind of HTML String

Question

i need to divide any kind of html code (string) to a list of tokens. For example: or or What i tried to do : My output: So i tried to split at &#8220;/>&#8221; which is working for the first case. Then i tried several things. Tried to identify the &#8220;name&#8221;, so the first identifier of the html str…

Accepted Answer

You will need a stack data structure and iterate over the string, push the position of opening tags onto the stack, and then when you encounter a closing tag, we assume either:its name matches the name of the tag beginning at the position on the top of the stackit is a self-closing tagWe also maintain a result list to save the parsed substrings.For 1), we simply pop the position on the top of the stack, and save the substring sliced from this popped position until to the end of the closing tag to the result list.For 2), we do not modify the stack, and only save the self-closing tag substring to the result list.After encountering any tag (opening, closing, self-closing), we walk the iterator (a.k.a. current position pointer) forward by the length of that tag (from < to corresponding >).If the html string sliced from the iterator onward does not match (from the beginning) any tag, then we simply walk the iterator forward by one (we crawl until we can again match a tag).Here is my attempt:import redef split(html): if html == "": return [] openingTagPattern = r"<([a-zA-Z]+)(?:s[^>]*)*(?" closingTagPattern = r"" selfClosingTagPattern = r"<([a-zA-Z]+).*?/>" result = [] stack = [] i = 0 while i < len(html): match = re.match(openingTagPattern, html[i:]) if match: # opening tag stack.append(i) # push position of start of opening tag onto stack i += len(match[0]) continue match = re.match(closingTagPattern, html[i:]) if match: # closing tag i += len(match[0]) result.append(html[stack.pop():i]) # pop position of start of corresponding opening tag from stack continue match = re.match(selfClosingTagPattern, html[i:]) if match: # self-closing tag start = i i += len(match[0]) result.append(html[start:i]) continue i+=1 # otherwise crawl until we can match a tag return result # reached the end of the stringUsage:delimitedList = split(""" test123 """)for item in delimitedList: print(item)Output: test123 References:The openingTagPattern is inspired from @Kobi ‘s answer here: https://stackoverflow.com/a/1732395/12109043

Advertisement

Answer