Skip to content
Advertisement

Python Pandas find rows that match a pattern using column first characters and a set of values to match

I have a sorted DataFrame by company_name:

    company_name
0     A
1     AA
2     abcd
3     abcdu
4     abcdw
5     efg
6     efgu
7     zvttu
8     zvttw
     

I would like to select the rows which have the first 3 letters in common and have the following rows ending with “u” or “w”.

Ideally I would like the result to look like this (including the “main” name as an extra column).

    company_name,  main_name
0     abcd        abcd       
1     abcdu       abcd
2     abcdw       abcd
3     efg         efg
4     efgu        efg

Assume that the start of the company_name has to contain u or w, the end of the name can differ. Best

Advertisement

Answer

Let’s try:

# extract company name by removing ending `uw`
s = df.company_name.str.extract('(.*)[uw]$', expand=False)

company_names = s.fillna(df.company_name)

# valid names are those appear alone and with `uw`
valid_names = s.isna().groupby(company_names).transform('nunique') == 2

df['main_name'] = company_names.where(valid_names)

Output:

  company_name main_name
0         abcd      abcd
1        abcdu      abcd
2        abcdw      abcd
3          efg       efg
4         efgu       efg
5        zvttu       NaN
6        zvttw       NaN
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement