I have a sorted DataFrame by company_name:
JavaScript
x
13
13
1
company_name
2
0 A
3
1 AA
4
2 abcd
5
3 abcdu
6
4 abcdw
7
5 efg
8
6 efgu
9
7 zvttu
10
8 zvttw
11
12
13
I would like to select the rows which have the first 3 letters in common and have the following rows ending with “u” or “w”.
Ideally I would like the result to look like this (including the “main” name as an extra column).
JavaScript
1
8
1
company_name, main_name
2
0 abcd abcd
3
1 abcdu abcd
4
2 abcdw abcd
5
3 efg efg
6
4 efgu efg
7
8
Assume that the start of the company_name has to contain u or w, the end of the name can differ. Best
Advertisement
Answer
Let’s try:
JavaScript
1
10
10
1
# extract company name by removing ending `uw`
2
s = df.company_name.str.extract('(.*)[uw]$', expand=False)
3
4
company_names = s.fillna(df.company_name)
5
6
# valid names are those appear alone and with `uw`
7
valid_names = s.isna().groupby(company_names).transform('nunique') == 2
8
9
df['main_name'] = company_names.where(valid_names)
10
Output:
JavaScript
1
9
1
company_name main_name
2
0 abcd abcd
3
1 abcdu abcd
4
2 abcdw abcd
5
3 efg efg
6
4 efgu efg
7
5 zvttu NaN
8
6 zvttw NaN
9