Skip to content
Advertisement

print strings of one dataframe contained in another dataframe

I have two dataframes: one dataframe consists of two columns (‘good’ and bad’) and another one that contains text data.

JavaScript

Now I would like to retrieve exact string matches of words that are in the dictionary and are contained in col1 of df_text and assign the string match to the second column of df_text.

I tried .isin(), however this code only shows exact string matches if the whole phrase matches and not if the word is contained in the sentence.

df_text should then look as follows:

col1 string_match_good string_match_bad
i love cats love
i hate dogs hate

I do not want partial string matches, e.g. if col1 says 'i loved cats', then I do not want a string match.

I found the following: matches = df_text[df_text['col1'].str.contains(fr"b(?:{'|'.join(df_dictionary)})b")] , however this one does not print the matched words (i.e. good or bad) in the string_match columns.

Does anyone have a solution to it?

Advertisement

Answer

I think the data structure is not ideal, specifically because your text values are conceptually several values in one (i.e., lists of tokens/words) but pandas works best with one value per cell. Here’s how I’d approach it:

  1. Explode the strings such that you get one word per cell.
JavaScript

Intermediary result:

JavaScript
  1. Now you can merge col1 with df_dictionary, once for each of the two labels good and bad:
JavaScript

Now df_text looks like this:

JavaScript

AFAICT, this should already contain all the information you need.

  1. Re-combine the words into sentences, using the sent_index we set earlier.
JavaScript

The final result then is:

JavaScript

Note that in case of multiple matches, you’d get the labels as joined strings, too. E.g., I dislike dogs but don't hate them would occur as 'dislike hate' in the bad column. Whether or not that’s alright depends on your next steps. Note that this is no problem for the data structure in step 2.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement