I’m trying to understand a Key-Bigram extractor’s working and I cannot understand what does the following block of code do. Here is the source code. Everything else is workin fine and I understood well, however I can not understand what child for child in possible_words.children does. Answer token.children uses the dependency parse to get all tokens that directly depend on
Tag: nlp
Gensim Word2Vec exhausting iterable
I’m getting the following prompt when calling model.train() from gensim word2vec The only solutions I found on my search for an answer point to the itarable vs iterator difference, and at this point, I tried everything I could to solve this on my own, currently, my code looks like this: The corpus variable is a list containing sentences, and each
page scraping using beautiful soup, without links
I am using the following code to extract text from a web page: The problem is, when I open text, I get all the links from the bottoms that exist at the top of the page, which I don’t want. How can i modify the above code to do so? I also gets the footnotes, which i may want, but
print strings of one dataframe contained in another dataframe
I have two dataframes: one dataframe consists of two columns (‘good’ and bad’) and another one that contains text data. Now I would like to retrieve exact string matches of words that are in the dictionary and are contained in col1 of df_text and assign the string match to the second column of df_text. I tried .isin(), however this code
How to resolve TypeError: cannot use a string pattern on a bytes-like object – word_tokenize, Counter and spacy
My dataset is a sales transactions history of an online store. I need to create a category based on the texts in the Description column. I have done some text pre-processing and clustering. This is how the dataframe cat_df head looks like: Description Text Cluster9 0 WHITE HANGING HEART T-LIGHT HOLDER white hanging heart t-light holder 1 1 WHITE METAL
How to create a list of tokenized words from dataframe column using spaCy?
I’m trying to apply spaCys tokenizer on dataframe column to get a new column containing list of tokens. Assume we have the following dataframe: The code below aims to tokenize Text column: The results looks like: Now, we have a new column tokens, which returns doc object for each sentence. How could we change the code to get a python
Regex: searching for words that starts with @ or @
I want to create a regex in python that find words that start with @ or @. I have created the following regex, but the output contains one extra space in each string as you can see However, the output that I want to have is the following I would be grateful if you could help me! Edit: @The fourth
How to handle numbers embedded in text during NLP pre-processing?
I am trying to run the LDA algorithm on a data set of news articles. I understand that numbers must be removed during the pre-processing step, and I have written a simple regex code to replace numbers with blanks. However, I would like to retain some numbers since removing them can potentially change the context/topic. For example, [Desired] ‘The fourth
How to calculate ticket classification after putting in a sentence? (Python/NLP)
I trained a model to classify tickets into 2 categories. I’m using GradientBoostClassifier. Now, I want to call on a function, where if I put any sentence in, the trained model would calculate the probability whether it will be category 1 or category 2. How do I write a code for this? Let’s imagine the sentence that I want to
KeyError on a certain word
I am trying to use Naive Bayes for spam-ham classification. I am getting a word error repeteadly on here: The error message is just this: ‘hafta’ is the first word of the pandas dataframe and the trainng dataset. I tried the solution on this issue that seemed similar to mine but it didn’t work out. I will appreciate any hint