Skip to content
Advertisement

Remove item from list of tuple with two elements across rows

I have a set of >1000 rows of POS-tagged sentences. I want to remove words that are tagged with “RB”, “IN”, “PRP”, “CC”, “PR”, “DT”, “CC”.

Here is my data, the “pos_tag” column shows how my data is now. The “pos_tag_clean” is what I would like to see after removing the words.

pos_tag pos_tag_clean
[(semoga, SC), (saja, RB), (di, IN), (sini, PR), (bisa, MD), (cepat, JJ), (cair, NN), (semoga, NN), (saja, RB), (ini, PR), (beneran, NN), (ada, VB), (nya, NN), (bantuan, NN), (buat, JJ), (butuh, VB), (banget, NN)] (semoga, SC), (bisa, MD), (cepat, JJ), (cair, NN), (semoga, NN), (beneran, NN), (ada, VB), (nya, NN), (bantuan, NN), (buat, JJ), (butuh, VB), (banget, NN)]
[(kak, VB), (kenapa, WH), (perbaikan, NN), (sistem, NN), (nya, PRP), (tidak, NEG), (selesai, VB)] [(kak, VB), (kenapa, WH), (perbaikan, NN), (sistem, NN), (tidak, NEG), (selesai, VB)]
[(sangat, RB), (baik, JJ)] [(baik, JJ)]

I tried using this code but the code is not suitable for looping across rows.

df['pos_tag'].pop(df['pos_tag'].index(('The', 'DT')))

invalid_tuples = []
for i, t in df['pos_tag']:
    if t[1] in ("RB", "IN", "PRP", "CC", "PR", "DT", "CC"):
        invalid_tuples.append(i)
for i in invalid_tuples:
    del df['pos_tag'][i]

Advertisement

Answer

Try:

forbidden = {"RB", "IN", "PRP", "CC", "PR", "DT", "CC"}

df["pos_tag_clean"] = df["pos_tag"].apply(
    lambda x: [(v, tag) for v, tag in x if tag not in forbidden]
)
print(df.to_markdown(index=False))

Prints:

pos_tag pos_tag_clean
[(‘semoga’, ‘SC’), (‘saja’, ‘RB’), (‘di’, ‘IN’), (‘sini’, ‘PR’), (‘bisa’, ‘MD’), (‘cepat’, ‘JJ’), (‘cair’, ‘NN’), (‘semoga’, ‘NN’), (‘saja’, ‘RB’), (‘ini’, ‘PR’), (‘beneran’, ‘NN’), (‘ada’, ‘VB’), (‘nya’, ‘NN’), (‘bantuan’, ‘NN’), (‘buat’, ‘JJ’), (‘butuh’, ‘VB’), (‘banget’, ‘NN’)] [(‘semoga’, ‘SC’), (‘bisa’, ‘MD’), (‘cepat’, ‘JJ’), (‘cair’, ‘NN’), (‘semoga’, ‘NN’), (‘beneran’, ‘NN’), (‘ada’, ‘VB’), (‘nya’, ‘NN’), (‘bantuan’, ‘NN’), (‘buat’, ‘JJ’), (‘butuh’, ‘VB’), (‘banget’, ‘NN’)]
[(‘kak’, ‘VB’), (‘kenapa’, ‘WH’), (‘perbaikan’, ‘NN’), (‘sistem’, ‘NN’), (‘nya’, ‘PRP’), (‘tidak’, ‘NEG’), (‘selesai’, ‘VB’)] [(‘kak’, ‘VB’), (‘kenapa’, ‘WH’), (‘perbaikan’, ‘NN’), (‘sistem’, ‘NN’), (‘tidak’, ‘NEG’), (‘selesai’, ‘VB’)]
[(‘sangat’, ‘RB’), (‘baik’, ‘JJ’)] [(‘baik’, ‘JJ’)]
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement