I want to replace multiple strings in my list of dataframes that match. I cannot get these to match and replace in place, instead it produces additional row entries.
Here’s the example data:
import pandas as pd
import re
from scipy import linalg
nm=['sr', 'pop15', 'pop75', 'dpi', 'ddpi']
df_tbl=pd.DataFrame(linalg.circulant(nm))
ls_comb = [df_tbl.loc[0:i] for i in range(0, len(df_tbl))]
extract_text=['dpi', 'pop15'] 
clean_text=['np.log(dpi)', 'np.log(pop15)']
cl_text=[re.search('(?<=\()[^\^\)]+', i).group(0) for i in clean_text]
int_text=list(set(extract_text).intersection(cl_text))
I know that int_text is the same as extract_text, but in some instances I may only have one np.log for clean_text, so I just left this as is as I would be using int_text to filter.
And what I have tried:
[
    i.apply(
        lambda x: [
            re.sub(rf"b{ext_t}b", cln_t, val)
            for val in x
            for ext_t, cln_t in zip(int_text, clean_text)
        ]
    )
    for i in ls_comb
]
It produces the following:
[    0     1            2      3              4
 0  sr  ddpi  np.log(dpi)  pop75          pop15
 1  sr  ddpi          dpi  pop75  np.log(pop15),
                0     1            2            3              4
 0             sr  ddpi  np.log(dpi)        pop75          pop15
 1             sr  ddpi          dpi        pop75  np.log(pop15)
 2          pop15    sr         ddpi  np.log(dpi)          pop75
 3  np.log(pop15)    sr         ddpi          dpi          pop75,
                0              1            2            3              4
 0             sr           ddpi  np.log(dpi)        pop75          pop15
 1             sr           ddpi          dpi        pop75  np.log(pop15)
 2          pop15             sr         ddpi  np.log(dpi)          pop75
 3  np.log(pop15)             sr         ddpi          dpi          pop75
 4          pop75          pop15           sr         ddpi    np.log(dpi)
 5          pop75  np.log(pop15)           sr         ddpi            dpi,
.
.
.
However, it produces additional rows, I expect a clean solution like this:
[       0            1            2            3            4
 0     sr          ddpi       np.log(dpi)    pop75      np.log(pop15),
        0            1            2            3            4
 0     sr          ddpi       np.log(dpi)     pop75     np.log(pop15)
 1  np.log(pop15)   sr          ddpi       np.log(dpi)     pop75,
.
.
.
Advertisement
Answer
import pandas as pd from scipy import linalg nm=['sr', 'pop15', 'pop75', 'dpi', 'ddpi'] df_tbl=pd.DataFrame(linalg.circulant(nm)) extract_text=['dpi', 'pop15'] clean_text=['np.log(dpi)', 'np.log(pop15)'] df_tbl.replace(extract_text, clean_text, inplace=True) print(df_tbl)
Output:
0 1 2 3 4 0 sr ddpi np.log(dpi) pop75 np.log(pop15) 1 np.log(pop15) sr ddpi np.log(dpi) pop75 2 pop75 np.log(pop15) sr ddpi np.log(dpi) 3 np.log(dpi) pop75 np.log(pop15) sr ddpi 4 ddpi np.log(dpi) pop75 np.log(pop15) sr