Skip to content
Advertisement

pandas: manage duplicated sentences on different columns

I have a dataframe as follows:

import pandas as pd

d = {'col1': ['1_a','2_a','3_a','4_a'], 'col2': ['Joe waited for the train.', 'The train was late.','Mary and Samantha took the bus.', 'I looked for Mary and Samantha at the bus station.'],
     'col3': ['the weather is nice','the weather is cold', 'i like going out', 'Joe waited for the train.'], 'col4': ['the house looks amazing','his profession is unknown','it is a beautiful day','we just moved to this house']}

df = pd.DataFrame(data=d)

I want to add the first column value to a sentence if that sentence is repeated somewhere else in the next three columns. so my desired output would be

col1 col2 col3 col4
1_a 1_aJoe waited for the train. the weather is nice the house looks amazing
2_a The train was late. the weather is cold his profession is unknown
3_a Mary and Samantha took the bus. i like going out it is a beautiful day
4_a I looked for Mary and Samantha at the bus station 4_aJoe waited for the train. we just moved to this house

and this is what I did so far,

from collections import Counter

dict= df.set_index('col1').T.to_dict('list')
return_dict = {k: list(v) for k, v in dict.items()}
counts = Counter([a for v in return_dict.values() for a in v])

for k, v in return_dict.items():
    for a in v:
        if counts[a] > 1:
            v.append(k+a)
            v.remove(a)

the problem is that the append would add the sentences to the end of the list, which means it would mess up the original dataframes order (as wanted in the desired output).

Advertisement

Answer

You can actually do some fancy numpy broadcasting here.

search_cols = ['col2', 'col3', 'col4']
index_col = 'col1'

x = (df[search_cols].to_numpy() == df[search_cols].to_numpy()[:, :, None, None]) # Generate a list of grids of all the duplicate values anywhere
x = (x.sum(axis=3).sum(axis=2) - 1).astype(bool) # Combine the grids, filter for only those with more than one item (so 1 or more duplicates), and convert that to a boolean mask selecting all duplicated cells

mask = pd.DataFrame(mask, index=df.index, columns=search_cols) # Convert the boolean mask into a dataframe matching the row labels and col labels of the original dataframe
mask = mask.apply(lambda col: df.loc[col, index_col]).reindex(mask.index).fillna('') # Replace all True values in the mask to values from the key col, `col`, and replace all False values with an empty string

new_df = mask + df[search_cols]

Output:

>>> new_df
                                                col2                          col3                         col4
0                       1_aJoe waited for the train.           the weather is nice      the house looks amazing
1                                The train was late.           the weather is cold    his profession is unknown
2                    Mary and Samantha took the bus.              i like going out        it is a beautiful day
3  I looked for Mary and Samantha at the bus stat...  4_aJoe waited for the train.  we just moved to this house
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement