Skip to content
Advertisement

Regex Pattern in Python for special charaters

I asked a similar question a few days ago on here and it was great help! A new challenge I wanted build is to further develop the regex pattern to look for specific formats in this iteration, and I thought I have solved it using regex 101 to build/test a regex code but when applied in Python received ‘pattern contain no group’. Below is a test df, and a image of what the results should be like/code that was provided via StackOverflow that worked with digits only.

       
df = pd.DataFrame([["{1} | | Had a Greeter welcome clients {1.0}     | | Take measures to ensure a safe and organized distribution {1.000}         | | Protected confidentiality of clients (on social media, pictures, in conversation, own congregation members receiving assistance, etc.)",
                    "{1.00}  | | Chairs for clients to sit in while waiting {1.0000}     | | Take measures to ensure a safe and organized distribution"],
                   ["{1  } | Financial literacy/budgeting {1   } | | Monetary/Bill Support {1}    | | Mental Health Services/Counseling",
                    "{1}| | Clothing Assistance {1       }  | | Healthcare {1}    | | Mental Health Services/Counseling {1}     | | Spiritual Support {1}      | | Job Skills Training"]
                    ] , columns = ['CF1', 'CF2'])

Result

Here is the iteration code that worked digits only. I changed the pattern search with my new regex pattern and it did not work.

Original code: (df.stack().str.extractall('(d+)')[0] .groupby(level=[0,1]).sum().unstack())

New Code (Failed to recognize pattern): (df.stack().str.extractall(r'(?<={)[d+. ]+(?=})')[0].astype(int) .groupby(level=[0,1]).sum().unstack())

**In the test df you will see I want to only capture the numbers between “{}” and there’s a mixture of decimals and spaces following the number I want to capture and sum. The new pattern did not work in application so any help would be great! **

Advertisement

Answer

Your (?<={)[d+. ]+(?=}) regex contains no capturing groups while Series.str.extractall requires at least one capturing group to output a value.

You need to use

(df.stack().str.extractall(r'{s*(d+(?:.d+)?)s*}')[0].astype(float) .groupby(level=[0,1]).sum().unstack())

Output:

   CF1  CF2
0  3.0  2.0
1  3.0  5.0

The {s*(d+(?:.d+)?)s*} regex matches

  • { – a { char
  • s* – zero or more whitespaces
  • (d+(?:.d+)?) – Group 1 (note this group captured value will be the output of the extractall method, it requires at least one capturing group): one or more digits, and then an optional occurrence of a . and one or more digits
  • s* – zero or more whitespaces
  • } – a } char.

See the regex demo.

Advertisement