I have a sorted DataFrame by company_name:
company_name 0 A 1 AA 2 abcd 3 abcdu 4 abcdw 5 efg 6 efgu 7 zvttu 8 zvttw
I would like to select the rows which have the first 3 letters in common and have the following rows ending with “u” or “w”.
Ideally I would like the result to look like this (including the “main” name as an extra column).
company_name, main_name 0 abcd abcd 1 abcdu abcd 2 abcdw abcd 3 efg efg 4 efgu efg
Assume that the start of the company_name has to contain u or w, the end of the name can differ. Best
Advertisement
Answer
Let’s try:
# extract company name by removing ending `uw` s = df.company_name.str.extract('(.*)[uw]$', expand=False) company_names = s.fillna(df.company_name) # valid names are those appear alone and with `uw` valid_names = s.isna().groupby(company_names).transform('nunique') == 2 df['main_name'] = company_names.where(valid_names)
Output:
company_name main_name 0 abcd abcd 1 abcdu abcd 2 abcdw abcd 3 efg efg 4 efgu efg 5 zvttu NaN 6 zvttw NaN