I have two dataframes: one dataframe consists of two columns (‘good’ and bad’) and another one that contains text data.
df_dictionary = pd.DataFrame({'good': ['love', 'like'], 'bad': ['dislike', 'hate']}) df_text = pd.DataFrame({'col1': ['i love cats', 'i hate dogs']})
Now I would like to retrieve exact string matches of words that are in the dictionary and are contained in col1 of df_text and assign the string match to the second column of df_text.
I tried .isin()
, however this code only shows exact string matches if the whole phrase matches and not if the word is contained in the sentence.
df_text should then look as follows:
col1 | string_match_good | string_match_bad |
---|---|---|
i love cats | love | |
i hate dogs | hate |
I do not want partial string matches, e.g. if col1 says 'i loved cats'
, then I do not want a string match.
I found the following:
matches = df_text[df_text['col1'].str.contains(fr"b(?:{'|'.join(df_dictionary)})b")]
, however this one does not print the matched words (i.e. good or bad) in the string_match columns.
Does anyone have a solution to it?
Advertisement
Answer
I think the data structure is not ideal, specifically because your text values are conceptually several values in one (i.e., lists of tokens/words) but pandas works best with one value per cell. Here’s how I’d approach it:
- Explode the strings such that you get one word per cell.
df_text = ( df_text.col1.str.split() # split into single words .explode() # explode them to one word per cell .rename_axis("sent_index") # rename the index for later .reset_index() # set the sent_index as its own column )
Intermediary result:
sent_index col1 0 0 i 1 0 love 2 0 cats 3 1 i 4 1 hate 5 1 dogs
- Now you can merge
col1
withdf_dictionary
, once for each of the two labelsgood
andbad
:
for label in ["good", "bad"]: df_text = df_text.merge(df_dictionary[label], left_on="col1", right_on=label, how="left")
Now df_text
looks like this:
sent_index col1 good bad 0 0 i NaN NaN 1 0 love love NaN 2 0 cats NaN NaN 3 1 i NaN NaN 4 1 hate NaN hate 5 1 dogs NaN NaN
AFAICT, this should already contain all the information you need.
- Re-combine the words into sentences, using the
sent_index
we set earlier.
df_final = (df_text.groupby("sent_index") .agg(list) .applymap(lambda s: ' '.join(w for w in s if not pd.isna(w))) )
The final result then is:
col1 good bad sent_index 0 i love cats love 1 i hate dogs hate
Note that in case of multiple matches, you’d get the labels as joined strings, too. E.g., I dislike dogs but don't hate them
would occur as 'dislike hate'
in the bad
column. Whether or not that’s alright depends on your next steps. Note that this is no problem for the data structure in step 2.