I have a sorted DataFrame by company_name:
company_name
0 A
1 AA
2 abcd
3 abcdu
4 abcdw
5 efg
6 efgu
7 zvttu
8 zvttw
I would like to select the rows which have the first 3 letters in common and have the following rows ending with “u” or “w”.
Ideally I would like the result to look like this (including the “main” name as an extra column).
company_name, main_name 0 abcd abcd 1 abcdu abcd 2 abcdw abcd 3 efg efg 4 efgu efg
Assume that the start of the company_name has to contain u or w, the end of the name can differ. Best
Advertisement
Answer
Let’s try:
# extract company name by removing ending `uw`
s = df.company_name.str.extract('(.*)[uw]$', expand=False)
company_names = s.fillna(df.company_name)
# valid names are those appear alone and with `uw`
valid_names = s.isna().groupby(company_names).transform('nunique') == 2
df['main_name'] = company_names.where(valid_names)
Output:
company_name main_name 0 abcd abcd 1 abcdu abcd 2 abcdw abcd 3 efg efg 4 efgu efg 5 zvttu NaN 6 zvttw NaN