Would love to know how to optimize this code without using for-loops, if it’s possible. What I’m trying to do is to categorize all the values in series df[‘Состояние’] looking at key words in lists list_rep and list_dem one by one. Thank you!
conditions = ['a','b'] list_rep = ['a1','a2'] list_dem = ['b1','b2'] for i in list_rep: df['Состояние'] = [conditions[0] if i in str(x).lower() else x for x in df['Состояние']] for i in list_exp: df['Состояние'] = [conditions[1] if i in str(x).lower() else x for x in df['Состояние']] for i in conditions: df['Состояние'] = [i if i in str(x).lower() else x for x in df['Состояние']] df['Состояние'] = [x if x in conditions else '-' for x in df['Состояние']]
Advertisement
Answer
Use Series.str.lower
fiirst, then Series.str.contains
with join
by |
for regex OR
and set new values in numpy.select
, then use Series.str.extract
and replace missing values:
df = pd.DataFrame({'Состояние':['abc','def','opa1','ujb2','a1sb1d','B21op']}) print (df) Состояние 0 abc 1 def 2 opa1 3 ujb2 4 a1sb1d 5 B21op
conditions = ['a','b'] list_rep = ['a1','a2'] list_dem = ['b1','b2'] s = df['Состояние'].str.lower() m1 = s.str.contains('|'.join(list_rep)) m2 = s.str.contains('|'.join(list_dem)) df['Состояние'] = np.select([m1, m2], [conditions[0], conditions[1]], s) df['Состояние'] = df['Состояние'].str.extract(f'({"|".join(conditions)})').fillna('-') print (df) Состояние 0 a 1 - 2 a 3 b 4 a 5 b
Another idea is create dictionary for mapping, first use Series.str.lower
and Series.str.extract
, then Series.map
and last replace missing values:
conditions = ['a','b'] list_rep = ['a1','a2'] list_dem = ['b1','b2'] d = {**dict.fromkeys(list_rep,conditions[0]), **dict.fromkeys(list_dem,conditions[1]), **dict(zip(conditions,conditions))} print (d) {'a1': 'a', 'a2': 'a', 'b1': 'b', 'b2': 'b', 'a': 'a', 'b': 'b'} pat = rf'({"|".join(d.keys())})' df['Состояние'] = (df['Состояние'].str.lower() .str.extract(pat, expand=False) .map(d) .fillna('-')) print (df) Состояние 0 a 1 - 2 a 3 b 4 a 5 b