tablet = ['ipad', 'tablet'] connection_issue = ['load', 'loading','error loading', 'connection issue'] blankQuestion = ['blank question', 'question loading', 'question does not load', 'question doesn't load','no question'] mocktest = ['mock test','mock tests'] df['bug_types'] = np.where(df['Ticket description'].str.contains(*tablet),'tablet', np.where(df['Ticket description'].str.contains(*connection_issue),'connectionz', np.where(df['Ticket description'].isin(blankQuestion),'blankQuestion', np.where(df['Ticket description'].str.contains(*mocktest),'mock tests', 'others'))))
There is no string in connection_issue :(. And the code works fine for tablet i.e., if I just changed the .str.contains(*connection_issue)
back to .isin(connection_issue)
, the rest, including .str.contains(*tablet)
, runs perfectly fine.
Advertisement
Answer
@matchifang has the right explanation as to why!
If you want to add more tags in the future based on different keywords, it would be good to have a more dynamic way of checking for tags, I recommend the following solution:
#!/usr/bin/env python from collections import OrderedDict import pandas as pd tags_keywords = OrderedDict([ ('tablet', ['ipad', 'tablet']), ('connection_issue', ['load', 'connection issue']), # 'loading' and 'error loading' will be picked up by 'load' ('blank_question', ['blank question', 'question loading', 'question does not load', 'question doesn't load', 'no question']), ('mock_test', ['mock test']), # 'mock tests' will be found by 'mock test' ('app_quit', ['quit']), # 'quitting' will be picked up by 'quit' ('scoring', ['sas', 'decile', 'attainment', 'stars']), # going to make everything lowercase for easier comparison ('alp', ['alp', 'atom learning point']), ('learning_journey', ['learning journey', 'world']), ('transcript', ['transcript', 'score card']), ('practice', ['practice', 'custom', 'suggested']) # 'practice' covers 'suggested practice', 'custom practice', etc ]) df = pd.DataFrame([{ 'Ticket description': "I'm having trouble loading the world" }, { 'Ticket description': "ALPs make no sense at all" }, { 'Ticket description': "My attainment score keeps going up and down" }, { 'Ticket description': "I can't finish my MOCK TEST" }]) df['bug_types'] = 'others' for tag, keywords in tags_keywords.items(): df.loc[df['Ticket description'].str.contains('|'.join(keywords), case=False), 'bug_types'] = tag print(df)
In this solution, we’re making the “later” tags in the dict a higher priority, and we’re using an OrderedDict
to guarantee that the order is respected (because regular dictionaries in Python don’t guarantee order). We’re creating the dict from a list of tuples because if we created it from another dict (and using a Python version <3.6) Python would first create the unordered dict so it couldn’t guarantee order either.
Then, we’re iterating over all the tag/keyword combinations, and looking for instances of any of the tags (which we’re putting together in the format @matchifang mentioned), but by adding case=False
we’re making the search case-independent so both uppercase and lowercase values will match.