Skip to content
Advertisement

Tag: tokenize

How to resolve TypeError: cannot use a string pattern on a bytes-like object – word_tokenize, Counter and spacy

My dataset is a sales transactions history of an online store. I need to create a category based on the texts in the Description column. I have done some text pre-processing and clustering. This is how the dataframe cat_df head looks like: Description Text Cluster9 0 WHITE HANGING HEART T-LIGHT HOLDER white hanging heart t-light holder 1 1 WHITE METAL

Substring any kind of HTML String

i need to divide any kind of html code (string) to a list of tokens. For example: or or What i tried to do : My output: So i tried to split at “/>” which is working for the first case. Then i tried several things. Tried to identify the “name”, so the first identifier of the html string like

Getting the number of words from tf.Tokenizer after fitting

I initially tried making an RNN that can predict Shakespeare text, and I did it successfully using character level-encoding. But when I switched to word level encoding, I ran into a multitude of issues. Specifically, I am having a hard time getting the total number of characters (I was told it was just dataset_size = tokenizer.document_count but this just returns

issue

it might be a basic question but I am stuck here not really sure what went wrong. df[‘text’] contains the text data that I want to work on and it returns [<nltk.tokenize.casual.TweetTokenizer object at 0x7f80216950a0>, <nltk.tokenize.casual.TweetTokenizer object at 0x7f8022278670>, <nltk.tokenize.casual.TweetTokenizer object at 0x7f7fec0bbc70>, <nltk.tokenize.casual.TweetTokenizer object at 0x7f7febf74970>, <nltk.tokenize.casual.TweetTokenizer object at 0x7f7febf747c0>, <nltk.tokenize.casual.TweetTokenizer object at 0x7f7febf74a90>, <nltk.tokenize.casual.TweetTokenizer object at 0x7f7febf748b0>, <nltk.tokenize.casual.TweetTokenizer

What is Keras’ Tokenizer fit_on_sequences used for?

I’m familiar with the method ‘fit_on_texts’ from the Keras’ Tokenizer. What does ‘fit_on_sequences’ do and when is it useful? According to the documentation, it “Updates internal vocabulary based on a list of sequences.”, and it takes as input: ‘A list of sequence. A “sequence” is a list of integer word indices.’. When is this useful? For fitting on texts, I

Advertisement