I have a similarity matrix of words and would like to apply an algorithm that can put the words in clusters. Here’s the example I have so far: Obviously this is a very simple dummy example, but what I would expect the output to be is 2 clusters, one with ‘The Bachelor’,’The Bachelorett…
Tag: nlp
What does this “.children” attribute do?
I’m trying to understand a Key-Bigram extractor’s working and I cannot understand what does the following block of code do. Here is the source code. Everything else is workin fine and I understood well, however I can not understand what child for child in possible_words.children does. Answer token…
Gensim Word2Vec exhausting iterable
I’m getting the following prompt when calling model.train() from gensim word2vec The only solutions I found on my search for an answer point to the itarable vs iterator difference, and at this point, I tried everything I could to solve this on my own, currently, my code looks like this: The corpus varia…
page scraping using beautiful soup, without links
I am using the following code to extract text from a web page: The problem is, when I open text, I get all the links from the bottoms that exist at the top of the page, which I don’t want. How can i modify the above code to do so? I also gets the footnotes, which i may want, but
print strings of one dataframe contained in another dataframe
I have two dataframes: one dataframe consists of two columns (‘good’ and bad’) and another one that contains text data. Now I would like to retrieve exact string matches of words that are in the dictionary and are contained in col1 of df_text and assign the string match to the second column …
How to resolve TypeError: cannot use a string pattern on a bytes-like object – word_tokenize, Counter and spacy
My dataset is a sales transactions history of an online store. I need to create a category based on the texts in the Description column. I have done some text pre-processing and clustering. This is how the dataframe cat_df head looks like: Description Text Cluster9 0 WHITE HANGING HEART T-LIGHT HOLDER white h…
How to create a list of tokenized words from dataframe column using spaCy?
I’m trying to apply spaCys tokenizer on dataframe column to get a new column containing list of tokens. Assume we have the following dataframe: The code below aims to tokenize Text column: The results looks like: Now, we have a new column tokens, which returns doc object for each sentence. How could we …
Regex: searching for words that starts with @ or @
I want to create a regex in python that find words that start with @ or @. I have created the following regex, but the output contains one extra space in each string as you can see However, the output that I want to have is the following I would be grateful if you could help me! Edit: @The fourth
How to handle numbers embedded in text during NLP pre-processing?
I am trying to run the LDA algorithm on a data set of news articles. I understand that numbers must be removed during the pre-processing step, and I have written a simple regex code to replace numbers with blanks. However, I would like to retain some numbers since removing them can potentially change the cont…
How to calculate ticket classification after putting in a sentence? (Python/NLP)
I trained a model to classify tickets into 2 categories. I’m using GradientBoostClassifier. Now, I want to call on a function, where if I put any sentence in, the trained model would calculate the probability whether it will be category 1 or category 2. How do I write a code for this? Let’s imagin…