Skip to content
Advertisement

How to query/filter cells against single values when cells have multiple values?

I have a csv file that follows the following format

Columns one Column two
Key1 Value1,Value2,value3
Key2 value5

I can easily use a list and .isin to filter the data-frame as follows:

list_keep = ['Value5']

dataframe[dataframe.isin(list_keep).any(axis=1)]

Which gives me the second row, but if there are cells with multiple values (like in the first row in the example table above with the Value1,Value2,value3) then the isin filters no longer works for single values like just value1. This makes sense since the "" is turning them into a single string which I missed because spreadsheets remove the "".

For example,When I do this

list_keep = ['Value1']

dataframe[dataframe.isin(list_keep).any(axis=1)]

Then the nothing is returned because the first row has Value1,Value2,value3 as one single string. (or the first row is not produced as output as the desired outcome).

IMPORTANT NOTE: I want to query all columns not just one.

So, how can I set this code up such I can query multiple elements with cells?

Is there a way to do this in pandas?

Advertisement

Answer

You can Stack the dataframe to reshape, then split and explode the strings and use isin to test for occurrence of strings in list_keep, then groupby on level=0 and reduce with any to create a boolean mask:

mask = df.stack().str.split(',').explode().isin(list_keep).groupby(level=0).any()

Alternative approach with applymap and set operations:

mask = df.applymap(lambda s: not set(s.split(',')).isdisjoint(list_keep)).any(1)

>>> df[mask]

  Columns one            Column two
0        Key1  Value1,Value2,value3
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement