I have a set of >1000 rows of POS-tagged sentences. I want to remove words that are tagged with “RB”, “IN”, “PRP”, “CC”, “PR”, “DT”, “CC”.
Here is my data, the “pos_tag” column shows how my data is now. The “pos_tag_clean” is what I would like to see after removing the words.
pos_tag | pos_tag_clean |
---|---|
[(semoga, SC), (saja, RB), (di, IN), (sini, PR), (bisa, MD), (cepat, JJ), (cair, NN), (semoga, NN), (saja, RB), (ini, PR), (beneran, NN), (ada, VB), (nya, NN), (bantuan, NN), (buat, JJ), (butuh, VB), (banget, NN)] | (semoga, SC), (bisa, MD), (cepat, JJ), (cair, NN), (semoga, NN), (beneran, NN), (ada, VB), (nya, NN), (bantuan, NN), (buat, JJ), (butuh, VB), (banget, NN)] |
[(kak, VB), (kenapa, WH), (perbaikan, NN), (sistem, NN), (nya, PRP), (tidak, NEG), (selesai, VB)] | [(kak, VB), (kenapa, WH), (perbaikan, NN), (sistem, NN), (tidak, NEG), (selesai, VB)] |
[(sangat, RB), (baik, JJ)] | [(baik, JJ)] |
I tried using this code but the code is not suitable for looping across rows.
df['pos_tag'].pop(df['pos_tag'].index(('The', 'DT'))) invalid_tuples = [] for i, t in df['pos_tag']: if t[1] in ("RB", "IN", "PRP", "CC", "PR", "DT", "CC"): invalid_tuples.append(i) for i in invalid_tuples: del df['pos_tag'][i]
Advertisement
Answer
Try:
forbidden = {"RB", "IN", "PRP", "CC", "PR", "DT", "CC"} df["pos_tag_clean"] = df["pos_tag"].apply( lambda x: [(v, tag) for v, tag in x if tag not in forbidden] ) print(df.to_markdown(index=False))
Prints:
pos_tag | pos_tag_clean |
---|---|
[(‘semoga’, ‘SC’), (‘saja’, ‘RB’), (‘di’, ‘IN’), (‘sini’, ‘PR’), (‘bisa’, ‘MD’), (‘cepat’, ‘JJ’), (‘cair’, ‘NN’), (‘semoga’, ‘NN’), (‘saja’, ‘RB’), (‘ini’, ‘PR’), (‘beneran’, ‘NN’), (‘ada’, ‘VB’), (‘nya’, ‘NN’), (‘bantuan’, ‘NN’), (‘buat’, ‘JJ’), (‘butuh’, ‘VB’), (‘banget’, ‘NN’)] | [(‘semoga’, ‘SC’), (‘bisa’, ‘MD’), (‘cepat’, ‘JJ’), (‘cair’, ‘NN’), (‘semoga’, ‘NN’), (‘beneran’, ‘NN’), (‘ada’, ‘VB’), (‘nya’, ‘NN’), (‘bantuan’, ‘NN’), (‘buat’, ‘JJ’), (‘butuh’, ‘VB’), (‘banget’, ‘NN’)] |
[(‘kak’, ‘VB’), (‘kenapa’, ‘WH’), (‘perbaikan’, ‘NN’), (‘sistem’, ‘NN’), (‘nya’, ‘PRP’), (‘tidak’, ‘NEG’), (‘selesai’, ‘VB’)] | [(‘kak’, ‘VB’), (‘kenapa’, ‘WH’), (‘perbaikan’, ‘NN’), (‘sistem’, ‘NN’), (‘tidak’, ‘NEG’), (‘selesai’, ‘VB’)] |
[(‘sangat’, ‘RB’), (‘baik’, ‘JJ’)] | [(‘baik’, ‘JJ’)] |