Skip to content
Advertisement

Pandas: str.contains first word followed by colon

I am trying to understand how to identify columns which start with a word, which is immediately followed by a colon.

I tried identifying that it should match less then 9 characters, followed by a colon – but with no lack.

The example is below:

  1. Michael: this should be picked up pandas
  2. This should not be picked up by pandas :(
  3. This should not: be picked up by pandas either.

I have tried multiple ways with str.contains and str.match but can’t seem to find a solution. Any advice will be greatly appreciated!

Thanks.

Advertisement

Answer

str.match will accept a regular expression. It seems like you want to match a sequence consisting of any number of consecutive letters, upper or lower case, followed immediately by a colon and then a space; and you don’t care what comes afterwards. In which case, try the code below.

import pandas as pd

df = pd.DataFrame(
    ['Michael: this should be picked up pandas',
     'This should not be picked up by pandas :(',
     'This should not: be picked up by pandas either.'],
    columns=['TestColumn']
    )
df['StartsWithWord'] = df.TestColumn.str.match(r'[A-Za-z]+: .*')

print(df)

This results in the following output.

                                        TestColumn  StartsWithWord
0         Michael: this should be picked up pandas            True
1        This should not be picked up by pandas :(           False
2  This should not: be picked up by pandas either.           False
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement