I am using scrapy to crawl a website and extract data from it, scrapy uses regex-based rules to check if a page has to be parsed, or a link has to be followed.
I am implementing a resume feature for my spider, so it could continue crawling from the last visited page. For this, I get the last followed link from a database when the spider is launched.
My site urls look like http://foobar.com/page1.html
, so, usually, the rule’s regex to follow every link like this would be something like /paged+.html
.
But how can I write a regex so it would match, for example, page 15 and more? Also, as I don’t know the starting point in advance, how could I generate this regex at runtime?
Advertisement
Answer
Try this:
def digit_match_greater(n): digits = str(n) variations = [] # Anything with more than len(digits) digits is a match: variations.append(r"d{%d,}" % (len(digits)+1)) # Now match numbers with len(digits) digits. # (Generate, e.g, for 15, "1[6-9]", "[2-9]d") # 9s can be skipped -- e.g. for >19 we only need [2-9]d. for i, d in enumerate(digits): if d != "9": pattern = list(digits) pattern[i] = "[%d-9]" % (int(d) + 1) for j in range(i+1, len(digits)): pattern[j] = r"d" variations.append("".join(pattern)) return "(?:%s)" % "|".join("(?:%s)" % v for v in variations)
It turned out easier to make it match numbers greater than the parameter, so if you give it 15, it’ll return a string for matching numbers 16 and greater, specifically…
(?:(?:d{3,})|(?:[2-9]d)|(?:1[6-9]))
You can then substitute this into your expression instead of d+
, like so:
exp = re.compile(r"page%s.html" % digit_match_greater(last_page_visited))