I have a “Pandas Data Frame”:
There is a bunch of Q&A that explains how to select rows that contain a given substring.
But I’m curious about finding how to split rows that are substring of a given string.
Unfortunately my datas are huge but suppose we have a column that its entries are single words.
For a given sentence we should return corresponding rows that have words of given sentence. For simple example:
df = pd.DataFrame({'Words': ['I', 'have', 'a', 'Pandas', 'Data', 'Frame']})
And the given sentence is:
s = 'You have one Pandas array Frame'
Now I need some thing like this:
df_s = df[df['Words'] in s]
That means:
df_s = pd.DataFrame({'Words': ['have', 'Pandas', 'Frame']})
Advertisement
Answer
apply
can be used to apply one function to all the rows (resp. columns) of a dataframe. It should not be used without caution, because as soon as you apply a Python function you lose the vectorization and performances fall down. Yet it is an appropriate tool here.
df['Words'] in s
should be written: df['Words'].apply(lambda x: x in s)
, and you end with:
print(df[df['Words'].apply(lambda x: x in s)]) Words 1 have 2 a 3 Pandas 5 Frame
Here we have kept the 'a'
, because it is indeed a substring of s
. I you want to keep words, you should use split
and compare full words:
s = 'You have one Pandas array Frame'.split()
It now gives the expected:
Words 1 have 3 Pandas 5 Frame