I have 2 dataframes, the first is a small dataframe (df1) with information to use to fill a field (named Flag) of the second dataframe (df2). I need to write a function that uses each row of df1 as parameters to fill each row of df2 with a certain value (Y or N).
df1 =
type | q25 | q75 |
---|---|---|
A | 13 | 98 |
B | 381 | 500 |
C | 34 | 103 |
df2 =
field1 | field2 | … | TYPE | delta | Flag |
---|---|---|---|---|---|
field1 | field2 | … | A | 379 | Y |
field1 | field2 | … | C | 90 | N |
field1 | field2 | … | A | 50 | N |
field1 | field2 | … | B | 2000 | Y |
I tried this code, but unfortunately it doesn’t work because it overwrites the lines in df2 related to the cases of df1 with type equal to A or B, and it works for the last record of df2 (type equal to C)
def filling(x, row): IQR = row['q75'] - row['q25'] if (x['Flag'] != 'Y'): if row['type'] == x['TYPE'] and (x['delta'] < row['q25'] - 1.5*IQR or x['delta'] > row['q75'] + 1.5*IQR): return 'Y' else: return 'N' for index, row in df2.iterrows(): df1['Flag'] = df1.apply(lambda x : filling(x, row), axis=1)
How could I fix it?
Advertisement
Answer
From what I understand you would like to have a flag column which tell you whether the particular row is an outlier or not. Here is a vectorized and concise way to achieve that:
# Merge the dataframes on type column s = df2.merge(df1, left_on='TYPE', right_on='type', how='left') # calculate IQR and condition to check for outlier s['IQR'] = s['q75'] - s['q25'] is_outlier = ~s['delta'].between(s['q25'] - 1.5 * s['IQR'], s['q25'] + 1.5 * s['IQR']) # Use np.where to select Y/N based on the outlier condition s['Flag'] = np.where(s['Flag'].ne('Y') & is_outlier, 'Y', s['Flag']) # drop the columns from df1 s = s.drop(columns=df1.columns)
Result
print(s) field1 field2 ... TYPE delta Flag IQR 0 field1 field2 ... A 379 Y 85 1 field1 field2 ... C 90 N 69 2 field1 field2 ... A 50 N 85 3 field1 field2 ... B 2000 Y 119