I am using scrapy to crawl a website and extract data from it, scrapy uses regex-based rules to check if a page has to be parsed, or a link has to be followed. I am implementing a resume feature for my spider, so it could continue crawling from the last visited page. For this, I get the last followed link
Tag: regex
Regular expression to return text between parenthesis
All I need is the contents inside the parenthesis. Answer If your problem is really just this simple, you don’t need regex:
Using a regular expression to replace upper case repeated letters in python with a single lowercase letter
I am trying to replace any instances of uppercase letters that repeat themselves twice in a string with a single instance of that letter in a lower case. I am using the following regular expression and it is able to match the repeated upper case letters, but I am unsure as how to make the letter that is being replaced
Do regular expressions from the re module support word boundaries (b)?
While trying to learn a little more about regular expressions, a tutorial suggested that you can use the b to match a word boundary. However, the following snippet in the Python interpreter does not work as expected: It should have been a match object if anything was matched, but it is None. Is the b expression not supported in Python
How to check that a regular expression has matched a string completely, i.e. – the string did not contain any extra character?
I have two questions: 1) I have a regular expression ([A-Z][a-z]{0,2})(d*) and I am using Python’s re.finditer() to match appropriate strings. My problem is, that I want to match only strings that contain no extra characters, otherwise I want to raise an exception. I want to catch a following pattern: – capital letter, followed by 0, 1 or 2 small
Python Regular Expression Match All 5 Digit Numbers but None Larger
I’m attempting to string match 5-digit coupon codes spread throughout a HTML web page. For example, 53232, 21032, 40021 etc… I can handle the simpler case of any string of 5 digits with [0-9]{5}, though this also matches 6, 7, 8… n digit numbers. Can someone please suggest how I would modify this regular expression to match only 5 digit
Python: Regex to extract part of URL found between parentheses
I have this weirdly formatted URL. I have to extract the contents in ‘()’. Sample URL : http://sampleurl.com/(K(ThinkCode))/profile/view.aspx If I can extract ThinkCode out of it, I will be a happy man! I am having a tough time with regexing special chars like ‘(‘ and ‘/’. Answer Explanation In regex-world, a lookbehind is a way of saying “I want to
match an alternative url – regular expression django urls
I want a Django URL with just 2 alternatives /module/in/ or /module/out/ I’m using But it matches other patterns like /module/i/, /module/n/ and /module/ou/. Any hint is appreciated :) Answer Try r’^(?P<status>in|out)/$’ You need to remove w+, which matches one or more alphanumeric characters or underscores. The regular expression suggested in bstpierre’s answer, ‘^(?P<status>w+(in|out))/$’ will match helloin, good_byeout and so
Extract part of a regex match
I want a regular expression to extract the title from a HTML page. Currently I have this: Is there a regular expression to extract just the contents of <title> so I don’t have to remove the tags? Answer Use ( ) in regexp and group(1) in python to retrieve the captured string (re.search will return None if it doesn’t find
In Python, how to check if a string only contains certain characters?
In Python, how to check if a string only contains certain characters? I need to check a string containing only a..z, 0..9, and . (period) and no other character. I could iterate over each character and check the character is a..z or 0..9, or . but that would be slow. I am not clear now how to do it with