I have a dataframe as follows:
import pandas as pd d = {'col1': ['1_a','2_a','3_a','4_a'], 'col2': ['Joe waited for the train.', 'The train was late.','Mary and Samantha took the bus.', 'I looked for Mary and Samantha at the bus station.'], 'col3': ['the weather is nice','the weather is cold', 'i like going out', 'Joe waited for the train.'], 'col4': ['the house looks amazing','his profession is unknown','it is a beautiful day','we just moved to this house']} df = pd.DataFrame(data=d)
I want to add the first column value to a sentence if that sentence is repeated somewhere else in the next three columns. so my desired output would be
col1 | col2 | col3 | col4 |
---|---|---|---|
1_a | 1_aJoe waited for the train. | the weather is nice | the house looks amazing |
2_a | The train was late. | the weather is cold | his profession is unknown |
3_a | Mary and Samantha took the bus. | i like going out | it is a beautiful day |
4_a | I looked for Mary and Samantha at the bus station | 4_aJoe waited for the train. | we just moved to this house |
and this is what I did so far,
from collections import Counter dict= df.set_index('col1').T.to_dict('list') return_dict = {k: list(v) for k, v in dict.items()} counts = Counter([a for v in return_dict.values() for a in v]) for k, v in return_dict.items(): for a in v: if counts[a] > 1: v.append(k+a) v.remove(a)
the problem is that the append would add the sentences to the end of the list, which means it would mess up the original dataframes order (as wanted in the desired output).
Advertisement
Answer
You can actually do some fancy numpy broadcasting here.
search_cols = ['col2', 'col3', 'col4'] index_col = 'col1' x = (df[search_cols].to_numpy() == df[search_cols].to_numpy()[:, :, None, None]) # Generate a list of grids of all the duplicate values anywhere x = (x.sum(axis=3).sum(axis=2) - 1).astype(bool) # Combine the grids, filter for only those with more than one item (so 1 or more duplicates), and convert that to a boolean mask selecting all duplicated cells mask = pd.DataFrame(mask, index=df.index, columns=search_cols) # Convert the boolean mask into a dataframe matching the row labels and col labels of the original dataframe mask = mask.apply(lambda col: df.loc[col, index_col]).reindex(mask.index).fillna('') # Replace all True values in the mask to values from the key col, `col`, and replace all False values with an empty string new_df = mask + df[search_cols]
Output:
>>> new_df col2 col3 col4 0 1_aJoe waited for the train. the weather is nice the house looks amazing 1 The train was late. the weather is cold his profession is unknown 2 Mary and Samantha took the bus. i like going out it is a beautiful day 3 I looked for Mary and Samantha at the bus stat... 4_aJoe waited for the train. we just moved to this house