Skip to content
Advertisement

select variable number of tokens from pandas column based on tuples in another column

I have a data frame with two columns: sentence containing text and selector containing arrays of tuples of varying lengths.

Consider the following data frame as an example:

JavaScript

I now want to select the words from sentence at the position indicated by the second element in each tuple (ignoring the 10 in each tuple). E.g. for the first row, I want the token in column sentence at the second position of all tuples (of which there is only one: (10, 0, 1)), i.e. the token at position 0: KEEP. (For clarity, I have spelled all words to be selected in ALL CAPS).

I would like to get a dataframe looking like this:

JavaScript

Accessing the first token works well using df['tok0_pos'] = df['selector'].str[0].str[1] for the positions and df['words0'] = [txt.split()[loc] for txt, loc in zip(df['sentence'], df['tok0_pos'])] for the tokens. However, due to the variable lengths (the real data set contains 0-25 tuples in the column selectors), this crashes quickly or is tedious.

Can someone point out how to best attain the column selected_tokens in the sample dataset?

Advertisement

Answer

One solution:

JavaScript

Output

JavaScript

An alternative solution, is to use numpy to take advantage of the advance indexing features:

JavaScript
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement