My data looks like this:
import pandas as pd codes = ['a', 'b', 'c'] to_flatten = [[c] * 6 for c in codes] nums = [1, 2, 3, 4, 5, 6] data_df = pd.DataFrame({'code': [item for subl in to_flatten for item in subl], 'num': nums * 3})
Because I used a pandas.groupby() process to generate my metadata, it looks like this:
meta_df = pd.DataFrame({'code': codes, 'start_at': [3, 2, 2], 'end_at': [5, 3, 2]})
Now, if my metadata looked like:
meta_dict = {'a': (3, 5), 'b': (1, 3), 'c': (2, 2)}
I could easily write:
def f(row): start_at, end_at = meta_dict[row['code']] if start_at <= row['num'] <= end_at: return 1 return 0 data_df['valid'] = data_df.apply(f, axis=1) print(data_df)
I feel that there should be a different, pandas-oriented, way to directly use the metadata in the meta_df dataframe format that I have, and that it’ll probably be more efficient than looping through the whole of the data_df. But what?
Is there a better approach, or should I just wrangle the values for each code out of meta_df and iterate with my function f() upon the rows of data_df?
Advertisement
Answer
dfn = pd.merge(data_df, meta_df, on='code', how='left') dfn['valid'] = np.where(dfn['num'].between(dfn.start_at, dfn.end_at), 1, 0) print(dfn)
code num start_at end_at valid 0 a 1 3 5 0 1 a 2 3 5 0 2 a 3 3 5 1 3 a 4 3 5 1 4 a 5 3 5 1 5 a 6 3 5 0 6 b 1 2 3 0 7 b 2 2 3 1 8 b 3 2 3 1 9 b 4 2 3 0 10 b 5 2 3 0 11 b 6 2 3 0 12 c 1 2 2 0 13 c 2 2 2 1 14 c 3 2 2 0 15 c 4 2 2 0 16 c 5 2 2 0 17 c 6 2 2 0