I am working on a project (content based search), for that I am using ‘pdftotext’ command line utility in Ubuntu which writes all the text from pdf to some text file. But it also writes bullets, now when I’m reading the file to index each word, it also gets some escape sequence indexed(like ‘x01’).I know its because of bullets(•). I
Tag: regex
Do Python regular expressions have an equivalent to Ruby’s atomic grouping?
Ruby’s regular expressions have a feature called atomic grouping (?>regexp), described here, is there any equivalent in Python’s re module? Answer Python does not directly support this feature, but you can emulate it by using a zero-width lookahead assert ((?=RE)), which matches from the current point with the same semantics you want, putting a named group ((?P<name>RE)) inside the lookahead,
regular expression match starting clause with end
I want to be able to capture the value of an HTML attribute with a python regexp. currently I use My problem is that I want the regular expression to “remember” whether the attribute started with a single or a double quote. I found the bug in my current approach with the following attribute my regex catches Answer You can
Regular expression in Python won’t match end of a string
I’m just learning Python, and I can’t seem to figure out regular expressions. I want this code to print ‘yes’, but it obstinately prints ‘no’. I’ve also tried each of the following: Plus countless other variations. I’ve been searching for quite a while, but can’t find/understand anything that solves my problem. Can someone help out a newbie? Answer You’ve tried
Using a RegEx to match IP addresses
I’m trying to make a test for checking whether a sys.argv input matches the RegEx for an IP address… As a simple test, I have the following… However when I pass random values into it, it returns “Acceptable IP address” in most cases, except when I have an “address” that is basically equivalent to d+. Answer You have to modify
heavy regex – really time consuming
I have the following regex to detect start and end script tags in the html file: meaning in short it will catch: <script “NOT THIS</s” > “NOT THIS</s” </script> it works but needs really long time to detect <script>, even minutes or hours for long strings The lite version works perfectly even for long string: however, the extended pattern I
Check for camel case in Python
I would like to check if a string is a camel case or not (boolean). I am inclined to use a regex but any other elegant solution would work. I wrote a simple regex Would this be correct? Or am I missing something? Edit I would like to capture names in a collection of text documents of the format Edit2
Python Regex to find a string in double quotes within a string
I’m looking for a code in python using regex that can perform something like this Input: Regex should return “String 1” or “String 2” or “String3” Output: String 1,String2,String3 I tried r'”*”‘ Answer Here’s all you need to do: result: As pointed out by Li-aung Yip: To elaborate, .+? is the “non-greedy” version of .+. It makes the regular expression
Get the string within brackets in Python
I have a sample string <alpha.Customer[cus_Y4o9qMEZAugtnW] active_card=<alpha.AlphaObject[card] …>, created=1324336085, description=’Customer for My Test App’, livemode=False> I only want the value cus_Y4o9qMEZAugtnW and NOT card (which is inside another []) How could I do it in easiest possible way in Python? Maybe by using RegEx (which I am not good at)? Answer How about: For me this prints: Note that the
Regular expression to find any number in a string
What’s the notation for any number in re? Like if I’m searching a string for any number, positive or negative. I’ve been using d+ but that can’t find 0 or -1 Answer Searching for positive, negative, and/or decimals, you could use [+-]?d+(?:.d+)? This isn’t very smart about leading/trailing zeros, of course: Edit: Corrected the above regex to find single-digit numbers.