Sort dataframe by substring condition excluding similar strings

Tags: , ,



I have a dataframe with a string type column named ‘tag’,

tag has three categories (data_types):

df['tag']
data_types=['DATA','DATAKIND','DATAKINDSIM']

If I want to count the number of rows there are by each data_type in ‘tag’ column, I apply the string include condition this way

for data in data_types:
    df[dtype.tag.str.contains(data_type,na=False)].count()

But, obviously, the counting for the tag ‘DATA’ include the real ‘DATA’ rows and both ‘DATAKIND’ and ‘DATAKINDSIM’ in the accounting; same for ‘DATAKIND’ and ‘DATAKINDSIM’. How can I exclude the similar strings in the column for ‘DATA’ accounting?

This is a reproducible example:

d = {'tag': ['DATA', 'DATAKIND','DATA','DATA','DATAKINDSIM','DATAKIND']}

df = pd.DataFrame(data=d)


data_types=['DATA','DATAKIND','DATAKINDSIM']

for data_type in data_types:
    print(data_type)
    counting=df[df.tag.str.contains(data_type,na=False)].count()
    print(counting)

And the output:

DATA
tag    6
dtype: int64
DATAKIND
tag    3
dtype: int64
DATAKINDSIM
tag    1
dtype: int64

This would be the expected output considering the accounting is performed excluding the similar strings, just accounting the concrete string match,

Expected output,

DATA
tag    3
dtype: int64
DATAKIND
tag    2
dtype: int64
DATAKINDSIM
tag    1
dtype: int64

Answer

If I understand you correctly you can use isin to first filter your tag column then use groupby.size

data_types=['DATA','DATAKIND','DATAKINDSIM']
df[df['tag'].isin(data_types)].groupby('tag')['tag'].size()

tag
DATA           3
DATAKIND       2
DATAKINDSIM    1
Name: tag, dtype: int64


Source: stackoverflow