Tag: tokenize

How to resolve TypeError: cannot use a string pattern on a bytes-like object – word_tokenize, Counter and spacy

My dataset is a sales transactions history of an online store. I need to create a category based on the texts in the Description column. I have done some text pre-processing and clustering. This is how the dataframe cat_df head looks like: Description Text Cluster9 0 WHITE HANGING HEART T-LIGHT HOLDER white h…

How to create a list of tokenized words from dataframe column using spaCy?

nlp pandas python spacy tokenize

I’m trying to apply spaCys tokenizer on dataframe column to get a new column containing list of tokens. Assume we have the following dataframe: The code below aims to tokenize Text column: The results looks like: Now, we have a new column tokens, which returns doc object for each sentence. How could we …

Substring any kind of HTML String

html parsing python tokenize web-crawler

i need to divide any kind of html code (string) to a list of tokens. For example: or or What i tried to do : My output: So i tried to split at “/>” which is working for the first case. Then i tried several things. Tried to identify the “name”, so the first identifier of the html str…

Getting the number of words from tf.Tokenizer after fitting

python python-3.x tensorflow tokenize

I initially tried making an RNN that can predict Shakespeare text, and I did it successfully using character level-encoding. But when I switched to word level encoding, I ran into a multitude of issues. Specifically, I am having a hard time getting the total number of characters (I was told it was just datase…

issue

nltk python tokenize

it might be a basic question but I am stuck here not really sure what went wrong. df[‘text’] contains the text data that I want to work on and it returns [<nltk.tokenize.casual.TweetTokenizer object at 0x7f80216950a0>, <nltk.tokenize.casual.TweetTokenizer object at 0x7f8022278670>, &lt…

What is Keras’ Tokenizer fit_on_sequences used for?

keras python tensorflow text-processing tokenize

I’m familiar with the method ‘fit_on_texts’ from the Keras’ Tokenizer. What does ‘fit_on_sequences’ do and when is it useful? According to the documentation, it “Updates internal vocabulary based on a list of sequences.”, and it takes as input: ‘A list of …