I am trying to understand how to identify columns which start with a word, which is immediately followed by a colon.
I tried identifying that it should match less then 9 characters, followed by a colon – but with no lack.
The example is below:
- Michael: this should be picked up pandas
- This should not be picked up by pandas :(
- This should not: be picked up by pandas either.
I have tried multiple ways with str.contains and str.match but can’t seem to find a solution. Any advice will be greatly appreciated!
Thanks.
Advertisement
Answer
str.match will accept a regular expression. It seems like you want to match a sequence consisting of any number of consecutive letters, upper or lower case, followed immediately by a colon and then a space; and you don’t care what comes afterwards. In which case, try the code below.
JavaScript
x
12
12
1
import pandas as pd
2
3
df = pd.DataFrame(
4
['Michael: this should be picked up pandas',
5
'This should not be picked up by pandas :(',
6
'This should not: be picked up by pandas either.'],
7
columns=['TestColumn']
8
)
9
df['StartsWithWord'] = df.TestColumn.str.match(r'[A-Za-z]+: .*')
10
11
print(df)
12
This results in the following output.
JavaScript
1
5
1
TestColumn StartsWithWord
2
0 Michael: this should be picked up pandas True
3
1 This should not be picked up by pandas :( False
4
2 This should not: be picked up by pandas either. False
5