I’d appreciate your help. I have a pandas dataframe. I want to search 3 columns of the dataframe using a regular expression, then return all rows that meet the search criteria, sorted by one of my columns. I would like to write this as a function so I can implement this logic with other criteria if possible, but am not quite sure how to do this.
For example, I know how pull the results of a search thusly (with col1 being a column name):
idx1 = df.col1.str.contains(r'vhigh|high', flags=re.IGNORECASE, regex=True, na=False) print df[~idx1]
but I can’t figure out how to take this type of action, and perform it with multiple columns and then sort. Anyone have any tips?
Advertisement
Answer
You can use apply
to make the code more concise. For example, given this DataFrame:
df = pd.DataFrame( { 'col1': ['vhigh', 'low', 'vlow'], 'col2': ['eee', 'low', 'high'], 'val': [100,200,300] } ) print df
Input:
col1 col2 val 0 vhigh eee 100 1 low low 200 2 vlow high 300
You can select all the rows that contain the strings vhigh
or high
in columns col1
or col2
as follow:
mask = df[['col1', 'col2']].apply( lambda x: x.str.contains( 'vhigh|high', regex=True ) ).any(axis=1) print df[mask]
The apply
function applies the contains
function on each column (since by default axis=0
). The any
function returns a Boolean mask, with element True indicating that at least one of the columns met the search criteria. This can then be used to perform selection on the original DataFrame.
Output:
col1 col2 val 0 vhigh eee 100 2 vlow high 300
Then, to sort the result by a column, e.g. the val
column, you could simply do:
df[mask].sort('val')