print strings of one dataframe contained in another dataframe

Question

I have two dataframes: one dataframe consists of two columns (&#8216;good&#8217; and bad&#8217;) and another one that contains text data. Now I would like to retrieve exact string matches of words that are in the dictionary and are contained in col1 of df_text and assign the string match to the second column …

Accepted Answer

I think the data structure is not ideal, specifically because your text values are conceptually several values in one (i.e., lists of tokens/words) but pandas works best with one value per cell. Here&#8217;s how I&#8217;d approach it:Explode the strings such that you get one word per cell.df_text = (           df_text.col1.str.split() # split into single words           .explode() # explode them to one word per cell           .rename_axis("sent_index") # rename the index for later           .reset_index() # set the sent_index as its own column           )Intermediary result:   sent_index  col10           0     i1           0  love2           0  cats3           1     i4           1  hate5           1  dogsNow you can merge col1 with df_dictionary, once for each of the two labels good and bad:for label in ["good", "bad"]:    df_text = df_text.merge(df_dictionary[label],                             left_on="col1",                             right_on=label,                             how="left")Now df_text looks like this:   sent_index  col1  good   bad0           0     i   NaN   NaN1           0  love  love   NaN2           0  cats   NaN   NaN3           1     i   NaN   NaN4           1  hate   NaN  hate5           1  dogs   NaN   NaNAFAICT, this should already contain all the information you need.Re-combine the words into sentences, using the sent_index we set earlier.df_final = (df_text.groupby("sent_index")            .agg(list)            .applymap(lambda s: ' '.join(w for w in s if not pd.isna(w)))           )The final result then is:                   col1  good   badsent_index                         0           i love cats  love      1           i hate dogs        hateNote that in case of multiple matches, you&#8217;d get the labels as joined strings, too. E.g., I dislike dogs but don't hate them would occur as 'dislike hate' in the bad column. Whether or not that&#8217;s alright depends on your next steps. Note that this is no problem for the data structure in step 2.

Advertisement

Answer