I have a dataframe with a string type column named ‘tag’,
tag has three categories (data_types):
df['tag'] data_types=['DATA','DATAKIND','DATAKINDSIM']
If I want to count the number of rows there are by each data_type in ‘tag’ column, I apply the string include condition this way
for data in data_types: df[dtype.tag.str.contains(data_type,na=False)].count()
But, obviously, the counting for the tag ‘DATA’ include the real ‘DATA’ rows and both ‘DATAKIND’ and ‘DATAKINDSIM’ in the accounting; same for ‘DATAKIND’ and ‘DATAKINDSIM’. How can I exclude the similar strings in the column for ‘DATA’ accounting?
This is a reproducible example:
d = {'tag': ['DATA', 'DATAKIND','DATA','DATA','DATAKINDSIM','DATAKIND']} df = pd.DataFrame(data=d) data_types=['DATA','DATAKIND','DATAKINDSIM'] for data_type in data_types: print(data_type) counting=df[df.tag.str.contains(data_type,na=False)].count() print(counting)
And the output:
DATA tag 6 dtype: int64 DATAKIND tag 3 dtype: int64 DATAKINDSIM tag 1 dtype: int64
This would be the expected output considering the accounting is performed excluding the similar strings, just accounting the concrete string match,
Expected output,
DATA tag 3 dtype: int64 DATAKIND tag 2 dtype: int64 DATAKINDSIM tag 1 dtype: int64
Advertisement
Answer
If I understand you correctly you can use isin
to first filter your tag
column then use groupby.size
data_types=['DATA','DATAKIND','DATAKINDSIM'] df[df['tag'].isin(data_types)].groupby('tag')['tag'].size() tag DATA 3 DATAKIND 2 DATAKINDSIM 1 Name: tag, dtype: int64