I am working with python and pandas. I have a dataset of lab analysis where I am dealing with multiple parameters and detection limits(dl). Many of the samples are reported as below the dl (e.g.<dl,<4)
For example:
import pandas as pd df=pd.DataFrame([['<4','88.72','<0.09'],['<1','5','<0.09'],['2','17.6','<0.09']], columns=['var_1','var_2','var_3']) df
My goal is to replace all <dl with dl/2 as a float value.
I can do this for one column pretty easily.
df['var_3'] = df.var_3.str.replace('<' ,'').astype(float) df['var_3'] = df['var_3'].apply(lambda x: x/2 if x == 0.09 else x) df
but this requires me looking at the dl and inputting it.
I would like to streamline it to apply it across all variables with one or more detection limits per variable as I have many variables and the detection limit will not always be constant from data frame to data frame this is applied to.
I found something similar in R but not sure how to apply it in python. Any solutions would be appreciated.
Update
So the
df=df.replace(r'<(.*)', r'1/2', regex=True).apply(pd.eval)
works well with dataframe with just columns with numbers. I assume that is a limitation of the eval function. For some reason I can get the code to work on smaller dataframes but after I concatenate them the code will not work on the larger dataframe and I get this error message:
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) /var/folders/9_/w2qcdj_x2x5852py8xl6b0sh0000gn/T/ipykernel_9403/3946462310.py in <module> ----> 1 MS=MS.replace(r'<(.*)', r'1/2', regex=True).apply(pd.eval) ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwargs) 8738 kwargs=kwargs, 8739 ) -> 8740 return op.apply() 8741 8742 def applymap( ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/apply.py in apply(self) 686 return self.apply_raw() 687 --> 688 return self.apply_standard() 689 690 def agg(self): ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/apply.py in apply_standard(self) 810 811 def apply_standard(self): --> 812 results, res_index = self.apply_series_generator() 813 814 # wrap results ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/apply.py in apply_series_generator(self) 826 for i, v in enumerate(series_gen): 827 # ignore SettingWithCopy here in case the user mutates --> 828 results[i] = self.f(v) 829 if isinstance(results[i], ABCSeries): 830 # If we have a view on v, we need to make a copy because ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/computation/eval.py in eval(expr, parser, engine, truediv, local_dict, global_dict, resolvers, level, target, inplace) 351 eng = ENGINES[engine] 352 eng_inst = eng(parsed_expr) --> 353 ret = eng_inst.evaluate() 354 355 if parsed_expr.assigner is None: ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/computation/engines.py in evaluate(self) 78 79 # make sure no names in resolvers and locals/globals clash ---> 80 res = self._evaluate() 81 return reconstruct_object( 82 self.result_type, res, self.aligned_axes, self.expr.terms.return_type ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/computation/engines.py in _evaluate(self) 119 scope = env.full_scope 120 _check_ne_builtin_clash(self.expr) --> 121 return ne.evaluate(s, local_dict=scope) 122 123 ~/opt/anaconda3/lib/python3.9/site-packages/numexpr/necompiler.py in evaluate(ex, local_dict, global_dict, out, order, casting, **kwargs) 821 822 # Create a signature --> 823 signature = [(name, getType(arg)) for (name, arg) in 824 zip(names, arguments)] 825 ~/opt/anaconda3/lib/python3.9/site-packages/numexpr/necompiler.py in <listcomp>(.0) 821 822 # Create a signature --> 823 signature = [(name, getType(arg)) for (name, arg) in 824 zip(names, arguments)] 825 ~/opt/anaconda3/lib/python3.9/site-packages/numexpr/necompiler.py in getType(a) 703 if kind == 'U': 704 raise ValueError('NumExpr 2 does not support Unicode as a dtype.') --> 705 raise ValueError("unknown type %s" % a.dtype.name) 706 707 ValueError: unknown type object
Advertisement
Answer
Use replace
instead str.replace
than eval
all expressions:
>>> df.replace(r'<(.*)', r'1/2', regex=True).apply(pd.eval) var_1 var_2 var_3 0 2.0 88.72 0.045 1 0.5 5.00 0.045 2 2.0 17.60 0.045
1
will be replace by the first capture group .*
Update
Alternative:
out = df.melt(ignore_index=False) m = out['value'].str.startswith('<') out.loc[m, 'value'] = out.loc[m, 'value'].str.strip('<').astype(float) / 2 out = out.reset_index().pivot('index', 'variable', 'value') .rename_axis(index=None, columns=None)
Output:
>>> out var_1 var_2 var_3 0 2.0 88.72 0.045 1 0.5 5 0.045 2 2 17.6 0.045
Update
Alternative using melt
to flatten your dataframe and pivot
to reshape to your original dataframe:
df1 = df.melt(ignore_index=False) m = df1['value'].str.startswith('<') df1['value'] = df1['value'].mask(~m).str[1:].astype(float).div(2) .fillna(df1['value']).astype(float) df1 = df1.reset_index().pivot_table('value', 'index', 'variable') .rename_axis(index=None, columns=None)
Output:
>>> df1 var_1 var_2 var_3 0 2.0 88.72 0.045 1 0.5 5.00 0.045 2 2.0 17.60 0.045