Python

I have the following regex to detect start and end script tags in the html file:

<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:[^s]))*)</script>

meaning in short it will catch: <script “NOT THIS</s” > “NOT THIS</s” </script>

it works but needs really long time to detect <script>, even minutes or hours for long strings

The lite version works perfectly even for long string:

<script[^<]*>[^<]*</script>

however, the extended pattern I use as well for other tags like <a> where < and > are possible to appears also as values of attributes.

python test:

import re
pattern = re.compile('<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:^s]))*)</script>', re.I + re.DOTALL)
re.search(pattern, '11<script type="text/javascript"> easy>example</script>22').group()
re.search(pattern, '<script type="text/javascript">' + ('hard example' * 50) + '</script>').group()

how can I fix it? The inner part of regex (after <script>) should be changed and simplified.

PS :) Anticipate your answers about the wrong approach like using regex in html parsing, I know very well many html/xml parsers, and what I can expect in often broken html code, and regex is really useful here.

comment: well, I need to handle:
each <a < document like this.border=”5px;”>
and approach is to use parsers and regex together BeautifulSoup is only 2k lines, which not handling every html and just extends regex from sgmllib.

and the main reason is that I must know exact the position where every tag starts and stop. and every broken html must be handled.
BS is not perfect, sometimes happens:
BeautifulSoup(‘< scriPtnn>a<aa>s< /script>’).findAll(‘script’) == []

@Cylian: atomic grouping as you know is not available in python’s re.
so non-geedy everything .*? until <s/stags*>** is a winner at this time.

I know that is not perfect in that case: re.search(‘<sscript.?<s*/sscripts>’,'< script </script> shit </script>’).group() but I can handle refused tail in the next parsing.

It’s pretty obvious that html parsing with regex is not one battle figthing.

Answer

I don’t know python, but I know regular expressions:

if you use the greedy/non-greedy operators you get a much simpler regex:

<script.*?>.*?</script>

This is assuming there are no nested scripts.

heavy regex – really time consuming

Advertisement

Answer