Skip to content
Advertisement

How to use pandas apply to replace iterrows?

I am calculating the sentiment value on every row in the dataset based on news headline. I used iterrows to achieve this:

field = 'headline'
dfp = pd.DataFrame(columns=('pos', 'neg', 'neu'))

tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")

model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

for index, row in df.iterrows():
    text = row[field]
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    probs = torch.nn.functional.softmax(output[0], dim=-1)
    probs_arr = probs.cpu().detach().numpy()
    dfp = dfp.append({'pos': probs_arr[0][0],
                      'neg': probs_arr[0][1],
                      'neu': probs_arr[0][2]
                     }, ignore_index=True)

However, the processing time is taking too long (>30 minutes runtime and it is not done yet). I have 16.6k rows in my dataset.

This is a small section of the dataset:

    datetime            headline
0   2020-03-17 16:57:07 12 best noise-cancelling headphones: In-ear an...
1   2020-06-08 14:00:55 5G Stocks To Buy And Watch: Pricing of 5G Smar...
2   2020-06-19 10:00:00 10 best wireless printers that will make your ...
3   2020-08-19 00:00:00 Apple Confirms Solid New iOS 14 Security Move ...
4   2020-08-19 00:00:00 Apple Becomes First U.S. Company Worth More Th...

I have read that iterrows is not recommended in most situation unless the dataset is small and optimization is not a concern. The alternative to it, it seem, is to use apply since apply go through each pandas row and is optimized.

Some of the SO topics I read suggested to put create a function and run it in apply. This is what I attempted:

def calPred(text):
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    probs = torch.nn.functional.softmax(output[0], dim=-1)
    probs_arr = probs.cpu().detach().numpy()
    dfp = dfp.append({'pos': probs_arr[0][0],
                      'neg': probs_arr[0][1],
                      'neu': probs_arr[0][2]
                     }, ignore_index=True)

df['headline'].apply(lambda x: calPred(x))

It returned an error UnboundLocalError: local variable 'dfp' referenced before assignment.

Appreciate if someone can guide me on how to optimize and use apply correctly. Thanks in advance.

Advertisement

Answer

The problem with your code is that when you do dfp = dfp.append..., dfp is already defined as global and you cannot reassign it (use another variable name) i.e dfp_temp = dfp.append....

However I think that apply is not what you want. Most models in ML will take as input an array-like so you can pass the whole column in the model (or at least a big chunk of it) and not each row.

Something like this

field = 'headline'

tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")

model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

texts = df[field].values
encoded_input = tokenizer(texts, return_tensors='pt')
output = model(encoded_input)
probs = torch.nn.functional.softmax(output, dim=-1)
probs = probs.cpu().detach().numpy()

dfp = pd.DataFrame({
    'pos': probs[:, 0],
    'neg': probs[:, 1],
    'neu': probs[:, 2]
})

Edit: Tokenizer does not support an array

you can try vectorizing the tokenizer like this

NOTE: np.vectorize and apply will not give you any significant boost since they still iterate over each element. However it is better to use apply and np.vectorize to the minimum possible extent.

...
tokenizer_func = lambda text: tokenizer(text, return_tensors='pt')
encoded_input = np.vectorize(tokenizer_func)(texts)
...
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement