I am working with python and pandas. I have a dataset of lab analysis where I am dealing with multiple parameters and detection limits(dl). Many of the samples are reported as below the dl (e.g.<dl,<4)
For example:
import pandas as pd
df=pd.DataFrame([['<4','88.72','<0.09'],['<1','5','<0.09'],['2','17.6','<0.09']], columns=['var_1','var_2','var_3'])
df
My goal is to replace all <dl with dl/2 as a float value.
I can do this for one column pretty easily.
df['var_3'] = df.var_3.str.replace('<' ,'').astype(float)
df['var_3'] = df['var_3'].apply(lambda x: x/2 if x == 0.09 else x)
df
but this requires me looking at the dl and inputting it.
I would like to streamline it to apply it across all variables with one or more detection limits per variable as I have many variables and the detection limit will not always be constant from data frame to data frame this is applied to.
I found something similar in R but not sure how to apply it in python. Any solutions would be appreciated.
Update
So the
df=df.replace(r'<(.*)', r'1/2', regex=True).apply(pd.eval)
works well with dataframe with just columns with numbers. I assume that is a limitation of the eval function. For some reason I can get the code to work on smaller dataframes but after I concatenate them the code will not work on the larger dataframe and I get this error message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/var/folders/9_/w2qcdj_x2x5852py8xl6b0sh0000gn/T/ipykernel_9403/3946462310.py in <module>
----> 1 MS=MS.replace(r'<(.*)', r'1/2', regex=True).apply(pd.eval)
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwargs)
8738 kwargs=kwargs,
8739 )
-> 8740 return op.apply()
8741
8742 def applymap(
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/apply.py in apply(self)
686 return self.apply_raw()
687
--> 688 return self.apply_standard()
689
690 def agg(self):
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/apply.py in apply_standard(self)
810
811 def apply_standard(self):
--> 812 results, res_index = self.apply_series_generator()
813
814 # wrap results
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/apply.py in apply_series_generator(self)
826 for i, v in enumerate(series_gen):
827 # ignore SettingWithCopy here in case the user mutates
--> 828 results[i] = self.f(v)
829 if isinstance(results[i], ABCSeries):
830 # If we have a view on v, we need to make a copy because
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/computation/eval.py in eval(expr, parser, engine, truediv, local_dict, global_dict, resolvers, level, target, inplace)
351 eng = ENGINES[engine]
352 eng_inst = eng(parsed_expr)
--> 353 ret = eng_inst.evaluate()
354
355 if parsed_expr.assigner is None:
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/computation/engines.py in evaluate(self)
78
79 # make sure no names in resolvers and locals/globals clash
---> 80 res = self._evaluate()
81 return reconstruct_object(
82 self.result_type, res, self.aligned_axes, self.expr.terms.return_type
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/computation/engines.py in _evaluate(self)
119 scope = env.full_scope
120 _check_ne_builtin_clash(self.expr)
--> 121 return ne.evaluate(s, local_dict=scope)
122
123
~/opt/anaconda3/lib/python3.9/site-packages/numexpr/necompiler.py in evaluate(ex, local_dict, global_dict, out, order, casting, **kwargs)
821
822 # Create a signature
--> 823 signature = [(name, getType(arg)) for (name, arg) in
824 zip(names, arguments)]
825
~/opt/anaconda3/lib/python3.9/site-packages/numexpr/necompiler.py in <listcomp>(.0)
821
822 # Create a signature
--> 823 signature = [(name, getType(arg)) for (name, arg) in
824 zip(names, arguments)]
825
~/opt/anaconda3/lib/python3.9/site-packages/numexpr/necompiler.py in getType(a)
703 if kind == 'U':
704 raise ValueError('NumExpr 2 does not support Unicode as a dtype.')
--> 705 raise ValueError("unknown type %s" % a.dtype.name)
706
707
ValueError: unknown type object
Advertisement
Answer
Use replace
instead str.replace
than eval
all expressions:
>>> df.replace(r'<(.*)', r'1/2', regex=True).apply(pd.eval)
var_1 var_2 var_3
0 2.0 88.72 0.045
1 0.5 5.00 0.045
2 2.0 17.60 0.045
1
will be replace by the first capture group .*
Update
Alternative:
out = df.melt(ignore_index=False)
m = out['value'].str.startswith('<')
out.loc[m, 'value'] = out.loc[m, 'value'].str.strip('<').astype(float) / 2
out = out.reset_index().pivot('index', 'variable', 'value')
.rename_axis(index=None, columns=None)
Output:
>>> out
var_1 var_2 var_3
0 2.0 88.72 0.045
1 0.5 5 0.045
2 2 17.6 0.045
Update
Alternative using melt
to flatten your dataframe and pivot
to reshape to your original dataframe:
df1 = df.melt(ignore_index=False)
m = df1['value'].str.startswith('<')
df1['value'] = df1['value'].mask(~m).str[1:].astype(float).div(2)
.fillna(df1['value']).astype(float)
df1 = df1.reset_index().pivot_table('value', 'index', 'variable')
.rename_axis(index=None, columns=None)
Output:
>>> df1
var_1 var_2 var_3
0 2.0 88.72 0.045
1 0.5 5.00 0.045
2 2.0 17.60 0.045