Skip to content
Advertisement

generate python regex at runtime to match numbers from ‘n’ to infinite

I am using scrapy to crawl a website and extract data from it, scrapy uses regex-based rules to check if a page has to be parsed, or a link has to be followed.

I am implementing a resume feature for my spider, so it could continue crawling from the last visited page. For this, I get the last followed link from a database when the spider is launched.

My site urls look like http://foobar.com/page1.html, so, usually, the rule’s regex to follow every link like this would be something like /paged+.html.

But how can I write a regex so it would match, for example, page 15 and more? Also, as I don’t know the starting point in advance, how could I generate this regex at runtime?

Advertisement

Answer

Try this:

def digit_match_greater(n):
    digits = str(n)
    variations = []
    # Anything with more than len(digits) digits is a match:
    variations.append(r"d{%d,}" % (len(digits)+1))
    # Now match numbers with len(digits) digits.
    # (Generate, e.g, for 15, "1[6-9]", "[2-9]d")
    # 9s can be skipped -- e.g. for >19 we only need [2-9]d.
    for i, d in enumerate(digits):
        if d != "9": 
            pattern = list(digits)
            pattern[i] = "[%d-9]" % (int(d) + 1)
            for j in range(i+1, len(digits)):
                pattern[j] = r"d"
            variations.append("".join(pattern))
    return "(?:%s)" % "|".join("(?:%s)" % v for v in variations)

It turned out easier to make it match numbers greater than the parameter, so if you give it 15, it’ll return a string for matching numbers 16 and greater, specifically…

(?:(?:d{3,})|(?:[2-9]d)|(?:1[6-9]))

You can then substitute this into your expression instead of d+, like so:

exp = re.compile(r"page%s.html" % digit_match_greater(last_page_visited))
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement