I have this string vocab file: https://drive.google.com/file/d/1mL461QGC5KcA3M1r8AESaPjZ3D_ufgPA/view?usp=sharing.
I have this sentences file, made from all vocab file above: https://drive.google.com/file/d/1w5ma4ROjyp6xmZfvnIQjsdH2I_K7lHoo/view?usp=sharing.
I want to map every sentences into its corresponding integer in the vocab file.
What I have tried to di is, firsti, I put all sentence into a list to this DataFrame:
import pandas as pd f = open(f'./drive/MyDrive/[kepsdataset/train_preprocess.txt', "r") output = [] dicts = {} tokens = [] tags = [] for line in f: if len(line.strip()) != 0: fields = line.split('t') text = fields[0].lower() tag = fields[1].strip() tokens.append(text) tags.append(tag) else: dicts['token'] = tokens # this is the sentences I want to map into integer dicts['tag'] = tags output.append(dicts) dicts = {} tokens = [] tags = [] df = pd.DataFrame(output) df.head(10)
I have converted the vocabulary list (from vocab file) into list of integer
import numpy as np my_file = open("vocab_uncased.txt", "r") data = my_file.read() data_into_list = data.split("n") print(data_into_list) encoded_string = [np.where(np.array(list(dict.fromkeys(data_into_list)))==e)[0][0]for e in data_into_list] print(encoded_string)
What I want to do is to put the encoded string into the DataFrame above. How can I do it? Example:
sentence (in token field in DataFrame): ['Setelah', 'melalui', 'proses', 'telepon', 'yang', 'panjang', 'tutup', 'sudah', 'kartu', 'kredit', 'bca', 'ribet'] encoded sentence (using vocab file): [2024, 1317, 1806, 2182, 2400, 1624, 2333, 2107, 1013, 1155, 317, 1853] --> to be put into a new dataframe column
Advertisement
Answer
IIUC:
df = pd.DataFrame(output) vocab = pd.Series(encoded_string, index=data_into_list) df['encoded'] = df.explode(df.columns.tolist())['token'] .map(vocab).groupby(level=0).agg(list)
Output:
>>> df token tag encoded 0 [setelah, melalui, proses, telepon, yang, panj... [O, B, B, I, O, O, B, O, B, I, I, B] [2024, 1317, 1806, 2182, 2400, 1624, 2333, 210... 1 [@halobca, saya, mencoba, mengakses, menu, m-b... [B, O, O, B, B, I, O, O, O, B, I, O, O, O, O, ... [130, 1917, 1374, 1403, 1470, 1240, 1917, 1545... 2 [hanya, saya, atau, @halobca, klikbca, bisnis,... [O, O, O, B, B, I, O, B] [857, 1917, 249, 130, 1130, 439, 1332, 767] 3 [teller, bank, bca, ini, menanyakan, kabar, sa... [O, O, O, O, O, O, O, B, O] [2190, 288, 317, 918, 1365, 983, 1917, 2081, 1... 4 [bca, senantiasa, menjaga, rahasia, data, cust... [B, O, B, B, B, I] [317, 1983, 1458, 1824, 575, 551] .. ... ... ... 794 [hi, cs, kenapa, pelayanan, di, bca, kodya, te... [O, B, O, B, O, B, I, I, I, I, I, I, O, B, O, ... [873, 540, 1077, 1657, 598, 317, 1136, 2175, 2... 795 [walau, sudah, prioritas, tetap, saja, antreny... [O, O, B, O, O, B, B, O, O, B, O, O, O] [2374, 2107, 1791, 2281, 1885, 231, 1183, 282,... 796 [selama, menggunakan, layanan, e-channel, bca,... [O, B, O, B, I, O, O, O, B, I, B, I, B, B, B, ... [1966, 1427, 1198, 746, 317, 1520, 2288, 1341,... 797 [mau, menabung, mau, simpan, uang, atau, pun, ... [O, B, O, B, B, O, O, B, B, I, O, O, O, B] [1306, 1361, 1306, 2055, 2335, 249, 1817, 1491... 798 [toko, daring, juga, kebanyakan, pakai, bca, m... [B, I, O, O, B, I, I, O, O, O, B, B, I, I, B, ... [2297, 569, 976, 1037, 1609, 317, 1238, 258, 1... [799 rows x 3 columns]