Skip to content
Advertisement

select variable number of tokens from pandas column based on tuples in another column

I have a data frame with two columns: sentence containing text and selector containing arrays of tuples of varying lengths.

Consider the following data frame as an example:

df = pd.DataFrame({'sentence': ['KEEP some of the words from this sentence.',
                                'Keep SOME of THE words from this sentence.',
                                'KEEP some OF the WORDS from this sentence.',
                                'Keep SOME of THE words FROM this SENTENCE.'],
                   'selector': [[(10, 0, 1)],
                                [(10, 1, 2), (10, 3, 4)],
                                [(10, 0, 1), (10, 2, 3), (10, 4, 5)],
                                [(10, 1, 2), (10, 3, 4), (10, 5, 6), (10, 7, 8)]]})

I now want to select the words from sentence at the position indicated by the second element in each tuple (ignoring the 10 in each tuple). E.g. for the first row, I want the token in column sentence at the second position of all tuples (of which there is only one: (10, 0, 1)), i.e. the token at position 0: KEEP. (For clarity, I have spelled all words to be selected in ALL CAPS).

I would like to get a dataframe looking like this:

sentence                                    selector                                           selected_tokens
KEEP some of the words from this sentence.  [(10, 0, 1)],                                      ['KEEP']
KEEP some OF the WORDS from this sentence.  [(10, 0, 1), (100, 2, 3), (10, 4, 5)],             ['KEEP', 'OF', 'WORDS']
Keep SOME of THE words from this sentence.  [(10, 1, 2), (10, 3, 4)],                          ['SOME', 'THE']
Keep SOME of THE words FROM this SENTENCE.  [(10, 1, 2), (10, 3, 4), (10, 5, 6), (10, 7, 8)],  ['SOME', 'THE', 'FROM', 'SENTENCE']

Accessing the first token works well using df['tok0_pos'] = df['selector'].str[0].str[1] for the positions and df['words0'] = [txt.split()[loc] for txt, loc in zip(df['sentence'], df['tok0_pos'])] for the tokens. However, due to the variable lengths (the real data set contains 0-25 tuples in the column selectors), this crashes quickly or is tedious.

Can someone point out how to best attain the column selected_tokens in the sample dataset?

Advertisement

Answer

One solution:

df["selected_tokens"] = [[sent[s] for _, s, _ in select] for sent, select in zip(df["sentence"].str.split(), df["selector"])]
print(df["selected_tokens"])

Output

0                          [KEEP]
1                     [SOME, THE]
2               [KEEP, OF, WORDS]
3    [SOME, THE, FROM, SENTENCE.]
Name: selected_tokens, dtype: object

An alternative solution, is to use numpy to take advantage of the advance indexing features:

import numpy as np

sentences = df["sentence"].str.split().apply(np.array)
indices = [[s[1] for s in select] for select in df["selector"]]
df["selected_tokens"] = [sentence[i] for sentence, i in zip(sentences, indices)]
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement