Skip to content
Advertisement

regex substitute every appearance of a capture group with another capture group

I am reformatting a large set of sales data.

Each sale shows the name of the item, number of items being sold, and the price rounded to the nearest whole number.

1 bag of 20 Apples sold for $3: Apple/,20,3,

If more than one sale occurs, the sales data replaces the item name for every result after the first one.

4 bags of 20 Apples sold for $3: Apple/,20,3%20,3%20,3%20,3,

I need to display the item name for every sale instead of the % sign

Desired Result: Apple/,20,3,Apple/,20,3,Apple/,20,3,Apple/,20,3,

So Far: I have been banging my head against this for six hours, and tried several approaches.

I had thought running a regex substitution using the re module in python using the expression ([A-Za-z]+/)?(%)?(d+,d+,) would produce the needed result after substituting 13 for the full match, however this only applies the first capture group to the beginning of all consecutive matches of the third capture group.

Apple/,20,3,20,3,20,3,20,3,

I suspect this has to do with the difference between capture groups and capture objects, but am stuck trying to find a way to append the first capture group to each appearance of a capture object for a given capture group (eg, append capture group 1 to the beginning of every match for capture group 3.)

To get around this, I tried a modified version of the answer for: https://stackoverflow.com/questions/32670413/replace-all-matches-using-re-findall

import re

regex = re.compile('([A-Za-z]+/)?(%)?(d+,d+,)', re.S)
itemsales =  'Apple/20,3,%20,3,%20,3,%20,3,'
sales_fixed = regex.sub(lambda m: m.group().replace('%',"1",1), myfile)
print(sales_fixed)

and this returns the exact same result of

Apple/,20,3,20,3,20,3,20,3,

which I suspect may be a result of incorrectly referencing my capture group in the substitution

How can I replace the percent signs with the product name?

Advertisement

Answer

The pattern that you tried only matches the last part because the first 2 parts are optional, and it can match the % and 20,3, part

The match the format described in the question, you could repeat the part matching the digits comma digits that is followed by a % after matching Apple/ first in group 1.

Then in the replacement, use capture group 1 between 2 comma’s x.group(1)

A few notes about the code and the pattern:

  • You don’t have to use re.S as there are no dots in the pattern that have to match a newline.
  • You don’t have to escape the , / and %
  • There are 2 different strings used in the description of the question, and in the example code.

The pattern could look like:

b([A-Za-z]+/),(?:d+,d+%)+
  • b A word boundary to prevent a partial match
  • ( Capture group 1
    • [A-Za-z]+/ Match 1+ times a char in the ranges A-Z a-z
  • ) Close group 1
  • ,(?:d+,d+%)+ Match a comma, and repeat 1+ times matching 1+ digits, a comma and again 1+ digits

For example

import re

pattern = r"b([A-Za-z]+/),(?:d+,d+%)+"
itemsales = "Apple/,20,3%20,3%20,3%20,3,"

sales_fixed = re.sub(
    pattern,
    lambda x: x.group().replace('%', ",{0},".format(x.group(1))),
    itemsales
)

print(sales_fixed)

Output

Apple/,20,3,Apple/,20,3,Apple/,20,3,Apple/,20,3,

Regex demo | Python demo

Advertisement