Search and filter pandas dataframe with regular expressions

Tags: , ,



I’d appreciate your help. I have a pandas dataframe. I want to search 3 columns of the dataframe using a regular expression, then return all rows that meet the search criteria, sorted by one of my columns. I would like to write this as a function so I can implement this logic with other criteria if possible, but am not quite sure how to do this.

For example, I know how pull the results of a search thusly (with col1 being a column name):

idx1 = df.col1.str.contains(r'vhigh|high', flags=re.IGNORECASE, regex=True, na=False)

print df[~idx1]

but I can’t figure out how to take this type of action, and perform it with multiple columns and then sort. Anyone have any tips?

Answer

You can use apply to make the code more concise. For example, given this DataFrame:

df = pd.DataFrame(
    {
        'col1': ['vhigh', 'low', 'vlow'],
        'col2': ['eee', 'low', 'high'],
        'val': [100,200,300]
    }
)
print df

Input:

    col1  col2  val
0  vhigh   eee  100
1    low   low  200
2   vlow  high  300

You can select all the rows that contain the strings vhigh or high in columns col1 or col2 as follow:

mask = df[['col1', 'col2']].apply(
    lambda x: x.str.contains(
        'vhigh|high',
        regex=True
    )
).any(axis=1)
print df[mask]

The apply function applies the contains function on each column (since by default axis=0). The any function returns a Boolean mask, with element True indicating that at least one of the columns met the search criteria. This can then be used to perform selection on the original DataFrame.

Output:

    col1  col2  val
0  vhigh   eee  100
2   vlow  high  300

Then, to sort the result by a column, e.g. the val column, you could simply do:

df[mask].sort('val')


Source: stackoverflow