Skip to content

Tag: tokenize

Substring any kind of HTML String

i need to divide any kind of html code (string) to a list of tokens. For example: or or What i tried to do : My output: So i tried to split at “/>” which is working for the first case. Then i tried several things. Tried to identify the “name”, so the first identifier of the html str…

issue

it might be a basic question but I am stuck here not really sure what went wrong. df[‘text’] contains the text data that I want to work on and it returns [<nltk.tokenize.casual.TweetTokenizer object at 0x7f80216950a0>, <nltk.tokenize.casual.TweetTokenizer object at 0x7f8022278670>, &lt…