Rows That Are Included/Contained In a String

Tags: , ,




I have a “Pandas Data Frame”:
There is a bunch of Q&A that explains how to select rows that contain a given substring.
But I’m curious about finding how to split rows that are substring of a given string.


Unfortunately my datas are huge but suppose we have a column that its entries are single words.
For a given sentence we should return corresponding rows that have words of given sentence. For simple example:

df = pd.DataFrame({'Words': ['I', 'have', 'a', 'Pandas', 'Data', 'Frame']})

And the given sentence is:

s = 'You have one Pandas array Frame'

Now I need some thing like this:

df_s = df[df['Words'] in s]

That means:

df_s = pd.DataFrame({'Words': ['have', 'Pandas', 'Frame']})

Answer

apply can be used to apply one function to all the rows (resp. columns) of a dataframe. It should not be used without caution, because as soon as you apply a Python function you lose the vectorization and performances fall down. Yet it is an appropriate tool here.

df['Words'] in s should be written: df['Words'].apply(lambda x: x in s), and you end with:

print(df[df['Words'].apply(lambda x: x in s)])
    Words
1    have
2       a
3  Pandas
5   Frame

Here we have kept the 'a', because it is indeed a substring of s. I you want to keep words, you should use split and compare full words:

s = 'You have one Pandas array Frame'.split()

It now gives the expected:

    Words
1    have
3  Pandas
5   Frame


Source: stackoverflow