I have a csv file that follows the following format
Columns one | Column two |
---|---|
Key1 | Value1,Value2,value3 |
Key2 | value5 |
I can easily use a list and .isin
to filter the data-frame as follows:
list_keep = ['Value5'] dataframe[dataframe.isin(list_keep).any(axis=1)]
Which gives me the second row, but if there are cells with multiple values (like in the first row in the example table above with the Value1,Value2,value3)
then the isin filters no longer works for single values like just value1
. This makes sense since the ""
is turning them into a single string which I missed because spreadsheets remove the ""
.
For example,When I do this
list_keep = ['Value1'] dataframe[dataframe.isin(list_keep).any(axis=1)]
Then the nothing is returned because the first row has Value1,Value2,value3
as one single string. (or the first row is not produced as output as the desired outcome).
IMPORTANT NOTE: I want to query all columns not just one.
So, how can I set this code up such I can query multiple elements with cells?
Is there a way to do this in pandas?
Advertisement
Answer
You can Stack
the dataframe to reshape, then split
and explode
the strings and use isin
to test for occurrence of strings in list_keep
, then groupby
on level=0
and reduce with any
to create a boolean mask:
mask = df.stack().str.split(',').explode().isin(list_keep).groupby(level=0).any()
Alternative approach with applymap
and set
operations:
mask = df.applymap(lambda s: not set(s.split(',')).isdisjoint(list_keep)).any(1)
>>> df[mask] Columns one Column two 0 Key1 Value1,Value2,value3