I’m trying to filter a dataframe by the first row, but can’t seem to figure out how to do it.
Here’s a sample version of the data I’m working with:
In [11]: df = pd.DataFrame( ...: [['Open-Ended Response', 'Open-Ended Response', 'Response', 'Response'], [1, 2, 3, 4]], ...: columns=list('ABCD'), ...: ) In [12]: df Out[12]: A B C D 0 Open-Ended Response Open-Ended Response Response Response 1 1 2 3 4
What I want to do is filter for all columns that start with “Response” in the first non-header row. So in this case, just have the last two columns in there own dataframe.
I can easily filter the header with something like this:
respo = [col for col in df if col.startswith('Response')]
But it doesn’t seem to work if it’s the 1t non-header row. Importantly, I need to keep the current header after I filter.
Thank you.
Advertisement
Answer
First step is to select the values of the first row:
df.iloc[0] # selects the values in the first row
Then, use python’s .str
StringAccessor methods for working with data values rather than column names:
df.iloc[0].str.startswith('Response') # Test the result of the above line
This will give you a Series with True/False values indexed by column name. Finally, use this to select the columns from your dataframe based on the matched labels:
df.loc[:, df.iloc[0].str.startswith('Response')] # Select columns based on the test
This should do the trick!
See pandas’s docs on Indexing and Selecting Data and the StringAccessor methods for more help.