I have a dataframe with a string type column named ‘tag’,
tag has three categories (data_types):
df['tag']
data_types=['DATA','DATAKIND','DATAKINDSIM']
If I want to count the number of rows there are by each data_type in ‘tag’ column, I apply the string include condition this way
for data in data_types:
df[dtype.tag.str.contains(data_type,na=False)].count()
But, obviously, the counting for the tag ‘DATA’ include the real ‘DATA’ rows and both ‘DATAKIND’ and ‘DATAKINDSIM’ in the accounting; same for ‘DATAKIND’ and ‘DATAKINDSIM’. How can I exclude the similar strings in the column for ‘DATA’ accounting?
This is a reproducible example:
d = {'tag': ['DATA', 'DATAKIND','DATA','DATA','DATAKINDSIM','DATAKIND']}
df = pd.DataFrame(data=d)
data_types=['DATA','DATAKIND','DATAKINDSIM']
for data_type in data_types:
print(data_type)
counting=df[df.tag.str.contains(data_type,na=False)].count()
print(counting)
And the output:
DATA
tag 6
dtype: int64
DATAKIND
tag 3
dtype: int64
DATAKINDSIM
tag 1
dtype: int64
This would be the expected output considering the accounting is performed excluding the similar strings, just accounting the concrete string match,
Expected output,
DATA
tag 3
dtype: int64
DATAKIND
tag 2
dtype: int64
DATAKINDSIM
tag 1
dtype: int64
Advertisement
Answer
If I understand you correctly you can use isin
to first filter your tag
column then use groupby.size
data_types=['DATA','DATAKIND','DATAKINDSIM']
df[df['tag'].isin(data_types)].groupby('tag')['tag'].size()
tag
DATA 3
DATAKIND 2
DATAKINDSIM 1
Name: tag, dtype: int64