My dataset is a sales transactions history of an online store. I need to create a category based on the texts in the Description column. I have done some text pre-processing and clustering. This is how the dataframe cat_df head looks like: Description Text Cluster9 0 WHITE HANGING HEART T-LIGHT HOLDER white hanging heart t-light holder 1 1 WHITE METAL
Tag: tokenize
How to create a list of tokenized words from dataframe column using spaCy?
I’m trying to apply spaCys tokenizer on dataframe column to get a new column containing list of tokens. Assume we have the following dataframe: The code below aims to tokenize Text column: The results looks like: Now, we have a new column tokens, which returns doc object for each sentence. How could we change the code to get a python
Substring any kind of HTML String
i need to divide any kind of html code (string) to a list of tokens. For example: or or What i tried to do : My output: So i tried to split at “/>” which is working for the first case. Then i tried several things. Tried to identify the “name”, so the first identifier of the html string like
Getting the number of words from tf.Tokenizer after fitting
I initially tried making an RNN that can predict Shakespeare text, and I did it successfully using character level-encoding. But when I switched to word level encoding, I ran into a multitude of issues. Specifically, I am having a hard time getting the total number of characters (I was told it was just dataset_size = tokenizer.document_count but this just returns
issue
it might be a basic question but I am stuck here not really sure what went wrong. df[‘text’] contains the text data that I want to work on and it returns [<nltk.tokenize.casual.TweetTokenizer object at 0x7f80216950a0>, <nltk.tokenize.casual.TweetTokenizer object at 0x7f8022278670>, <nltk.tokenize.casual.TweetTokenizer object at 0x7f7fec0bbc70>, <nltk.tokenize.casual.TweetTokenizer object at 0x7f7febf74970>, <nltk.tokenize.casual.TweetTokenizer object at 0x7f7febf747c0>, <nltk.tokenize.casual.TweetTokenizer object at 0x7f7febf74a90>, <nltk.tokenize.casual.TweetTokenizer object at 0x7f7febf748b0>, <nltk.tokenize.casual.TweetTokenizer
What is Keras’ Tokenizer fit_on_sequences used for?
I’m familiar with the method ‘fit_on_texts’ from the Keras’ Tokenizer. What does ‘fit_on_sequences’ do and when is it useful? According to the documentation, it “Updates internal vocabulary based on a list of sequences.”, and it takes as input: ‘A list of sequence. A “sequence” is a list of integer word indices.’. When is this useful? For fitting on texts, I