How to use pandas apply to replace iterrows?

Question

I am calculating the sentiment value on every row in the dataset based on news headline. I used iterrows to achieve this: However, the processing time is taking too long (>30 minutes runtime and it is not done yet). I have 16.6k rows in my dataset. This is a small section of the dataset: I have read that i…

Accepted Answer

The problem with your code is that when you do dfp = dfp.append..., dfp is already defined as global and you cannot reassign it (use another variable name) i.e dfp_temp = dfp.append....However I think that apply is not what you want. Most models in ML will take as input an array-like so you can pass the whole column in the model (or at least a big chunk of it) and not each row.Something like thisfield = 'headline'tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")texts = df[field].valuesencoded_input = tokenizer(texts, return_tensors='pt')output = model(encoded_input)probs = torch.nn.functional.softmax(output, dim=-1)probs = probs.cpu().detach().numpy()dfp = pd.DataFrame({    'pos': probs[:, 0],    'neg': probs[:, 1],    'neu': probs[:, 2]})Edit: Tokenizer does not support an arrayyou can try vectorizing the tokenizer like thisNOTE: np.vectorize and apply will not give you any significant boost since they still iterate over each element. However it is better to use apply and np.vectorize to the minimum possible extent....tokenizer_func = lambda text: tokenizer(text, return_tensors='pt')encoded_input = np.vectorize(tokenizer_func)(texts)...

Advertisement

Answer