tablet = ['ipad', 'tablet']
connection_issue = ['load', 'loading','error loading', 'connection issue']
blankQuestion = ['blank question', 'question loading', 'question does not load', 'question doesn't load','no question']
mocktest = ['mock test','mock tests']
df['bug_types'] = np.where(df['Ticket description'].str.contains(*tablet),'tablet',
np.where(df['Ticket description'].str.contains(*connection_issue),'connectionz',
np.where(df['Ticket description'].isin(blankQuestion),'blankQuestion',
np.where(df['Ticket description'].str.contains(*mocktest),'mock tests', 'others'))))
There is no string in connection_issue :(. And the code works fine for tablet i.e., if I just changed the .str.contains(*connection_issue)
back to .isin(connection_issue)
, the rest, including .str.contains(*tablet)
, runs perfectly fine.
Advertisement
Answer
@matchifang has the right explanation as to why!
If you want to add more tags in the future based on different keywords, it would be good to have a more dynamic way of checking for tags, I recommend the following solution:
#!/usr/bin/env python
from collections import OrderedDict
import pandas as pd
tags_keywords = OrderedDict([
('tablet', ['ipad', 'tablet']),
('connection_issue', ['load', 'connection issue']), # 'loading' and 'error loading' will be picked up by 'load'
('blank_question', ['blank question', 'question loading', 'question does not load', 'question doesn't load', 'no question']),
('mock_test', ['mock test']), # 'mock tests' will be found by 'mock test'
('app_quit', ['quit']), # 'quitting' will be picked up by 'quit'
('scoring', ['sas', 'decile', 'attainment', 'stars']), # going to make everything lowercase for easier comparison
('alp', ['alp', 'atom learning point']),
('learning_journey', ['learning journey', 'world']),
('transcript', ['transcript', 'score card']),
('practice', ['practice', 'custom', 'suggested']) # 'practice' covers 'suggested practice', 'custom practice', etc
])
df = pd.DataFrame([{
'Ticket description': "I'm having trouble loading the world"
}, {
'Ticket description': "ALPs make no sense at all"
}, {
'Ticket description': "My attainment score keeps going up and down"
}, {
'Ticket description': "I can't finish my MOCK TEST"
}])
df['bug_types'] = 'others'
for tag, keywords in tags_keywords.items():
df.loc[df['Ticket description'].str.contains('|'.join(keywords), case=False), 'bug_types'] = tag
print(df)
In this solution, we’re making the “later” tags in the dict a higher priority, and we’re using an OrderedDict
to guarantee that the order is respected (because regular dictionaries in Python don’t guarantee order). We’re creating the dict from a list of tuples because if we created it from another dict (and using a Python version <3.6) Python would first create the unordered dict so it couldn’t guarantee order either.
Then, we’re iterating over all the tag/keyword combinations, and looking for instances of any of the tags (which we’re putting together in the format @matchifang mentioned), but by adding case=False
we’re making the search case-independent so both uppercase and lowercase values will match.