I am trying to understand how to identify columns which start with a word, which is immediately followed by a colon.
I tried identifying that it should match less then 9 characters, followed by a colon – but with no lack.
The example is below:
- Michael: this should be picked up pandas
- This should not be picked up by pandas :(
- This should not: be picked up by pandas either.
I have tried multiple ways with str.contains and str.match but can’t seem to find a solution. Any advice will be greatly appreciated!
Thanks.
Advertisement
Answer
str.match will accept a regular expression. It seems like you want to match a sequence consisting of any number of consecutive letters, upper or lower case, followed immediately by a colon and then a space; and you don’t care what comes afterwards. In which case, try the code below.
import pandas as pd df = pd.DataFrame( ['Michael: this should be picked up pandas', 'This should not be picked up by pandas :(', 'This should not: be picked up by pandas either.'], columns=['TestColumn'] ) df['StartsWithWord'] = df.TestColumn.str.match(r'[A-Za-z]+: .*') print(df)
This results in the following output.
TestColumn StartsWithWord 0 Michael: this should be picked up pandas True 1 This should not be picked up by pandas :( False 2 This should not: be picked up by pandas either. False