Skip to content
Advertisement

generate python regex at runtime to match numbers from ‘n’ to infinite

I am using scrapy to crawl a website and extract data from it, scrapy uses regex-based rules to check if a page has to be parsed, or a link has to be followed.

I am implementing a resume feature for my spider, so it could continue crawling from the last visited page. For this, I get the last followed link from a database when the spider is launched.

My site urls look like http://foobar.com/page1.html, so, usually, the rule’s regex to follow every link like this would be something like /paged+.html.

But how can I write a regex so it would match, for example, page 15 and more? Also, as I don’t know the starting point in advance, how could I generate this regex at runtime?

Advertisement

Answer

Try this:

JavaScript

It turned out easier to make it match numbers greater than the parameter, so if you give it 15, it’ll return a string for matching numbers 16 and greater, specifically…

JavaScript

You can then substitute this into your expression instead of d+, like so:

JavaScript
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement