I have a similarity matrix of words and would like to apply an algorithm that can put the words in clusters. Here’s the example I have so far: Obviously this is a very simple dummy example, but what I would expect the output to be is 2 clusters, one with ‘The Bachelor’,’The Bachelorette’,’The Bachelor Special’, and the other with ‘SportsCenter’,’SportsCenter
Tag: nlp
What does this “.children” attribute do?
I’m trying to understand a Key-Bigram extractor’s working and I cannot understand what does the following block of code do. Here is the source code. Everything else is workin fine and I understood well, however I can not understand what child for child in possible_words.children does. Answer token.children uses the dependency parse to get all tokens that directly depend on
Gensim Word2Vec exhausting iterable
I’m getting the following prompt when calling model.train() from gensim word2vec The only solutions I found on my search for an answer point to the itarable vs iterator difference, and at this point, I tried everything I could to solve this on my own, currently, my code looks like this: The corpus variable is a list containing sentences, and each
page scraping using beautiful soup, without links
I am using the following code to extract text from a web page: The problem is, when I open text, I get all the links from the bottoms that exist at the top of the page, which I don’t want. How can i modify the above code to do so? I also gets the footnotes, which i may want, but
print strings of one dataframe contained in another dataframe
I have two dataframes: one dataframe consists of two columns (‘good’ and bad’) and another one that contains text data. Now I would like to retrieve exact string matches of words that are in the dictionary and are contained in col1 of df_text and assign the string match to the second column of df_text. I tried .isin(), however this code
How to resolve TypeError: cannot use a string pattern on a bytes-like object – word_tokenize, Counter and spacy
My dataset is a sales transactions history of an online store. I need to create a category based on the texts in the Description column. I have done some text pre-processing and clustering. This is how the dataframe cat_df head looks like: Description Text Cluster9 0 WHITE HANGING HEART T-LIGHT HOLDER white hanging heart t-light holder 1 1 WHITE METAL
How to create a list of tokenized words from dataframe column using spaCy?
I’m trying to apply spaCys tokenizer on dataframe column to get a new column containing list of tokens. Assume we have the following dataframe: The code below aims to tokenize Text column: The results looks like: Now, we have a new column tokens, which returns doc object for each sentence. How could we change the code to get a python
Regex: searching for words that starts with @ or @
I want to create a regex in python that find words that start with @ or @. I have created the following regex, but the output contains one extra space in each string as you can see However, the output that I want to have is the following I would be grateful if you could help me! Edit: @The fourth
How to handle numbers embedded in text during NLP pre-processing?
I am trying to run the LDA algorithm on a data set of news articles. I understand that numbers must be removed during the pre-processing step, and I have written a simple regex code to replace numbers with blanks. However, I would like to retain some numbers since removing them can potentially change the context/topic. For example, [Desired] ‘The fourth
How to calculate ticket classification after putting in a sentence? (Python/NLP)
I trained a model to classify tickets into 2 categories. I’m using GradientBoostClassifier. Now, I want to call on a function, where if I put any sentence in, the trained model would calculate the probability whether it will be category 1 or category 2. How do I write a code for this? Let’s imagine the sentence that I want to