I have this string vocab file: https://drive.google.com/file/d/1mL461QGC5KcA3M1r8AESaPjZ3D_ufgPA/view?usp=sharing.
I have this sentences file, made from all vocab file above: https://drive.google.com/file/d/1w5ma4ROjyp6xmZfvnIQjsdH2I_K7lHoo/view?usp=sharing.
I want to map every sentences into its corresponding integer in the vocab file.
What I have tried to di is, firsti, I put all sentence into a list to this DataFrame:
JavaScript
x
27
27
1
import pandas as pd
2
3
f = open(f'./drive/MyDrive/[kepsdataset/train_preprocess.txt', "r")
4
output = []
5
dicts = {}
6
tokens = []
7
tags = []
8
9
for line in f:
10
if len(line.strip()) != 0:
11
fields = line.split('t')
12
text = fields[0].lower()
13
tag = fields[1].strip()
14
tokens.append(text)
15
tags.append(tag)
16
else:
17
dicts['token'] = tokens # this is the sentences I want to map into integer
18
dicts['tag'] = tags
19
output.append(dicts)
20
dicts = {}
21
tokens = []
22
tags = []
23
24
df = pd.DataFrame(output)
25
26
df.head(10)
27
I have converted the vocabulary list (from vocab file) into list of integer
JavaScript
1
12
12
1
import numpy as np
2
3
my_file = open("vocab_uncased.txt", "r")
4
5
data = my_file.read()
6
7
data_into_list = data.split("n")
8
print(data_into_list)
9
10
encoded_string = [np.where(np.array(list(dict.fromkeys(data_into_list)))==e)[0][0]for e in data_into_list]
11
print(encoded_string)
12
What I want to do is to put the encoded string into the DataFrame above. How can I do it? Example:
JavaScript
1
3
1
sentence (in token field in DataFrame): ['Setelah', 'melalui', 'proses', 'telepon', 'yang', 'panjang', 'tutup', 'sudah', 'kartu', 'kredit', 'bca', 'ribet']
2
encoded sentence (using vocab file): [2024, 1317, 1806, 2182, 2400, 1624, 2333, 2107, 1013, 1155, 317, 1853] --> to be put into a new dataframe column
3
Advertisement
Answer
IIUC:
JavaScript
1
6
1
df = pd.DataFrame(output)
2
vocab = pd.Series(encoded_string, index=data_into_list)
3
4
df['encoded'] = df.explode(df.columns.tolist())['token']
5
.map(vocab).groupby(level=0).agg(list)
6
Output:
JavaScript
1
16
16
1
>>> df
2
token tag encoded
3
0 [setelah, melalui, proses, telepon, yang, panj [O, B, B, I, O, O, B, O, B, I, I, B] [2024, 1317, 1806, 2182, 2400, 1624, 2333, 210...
4
1 [@halobca, saya, mencoba, mengakses, menu, m-b [B, O, O, B, B, I, O, O, O, B, I, O, O, O, O, [130, 1917, 1374, 1403, 1470, 1240, 1917, 1545...
5
2 [hanya, saya, atau, @halobca, klikbca, bisnis, [O, O, O, B, B, I, O, B] [857, 1917, 249, 130, 1130, 439, 1332, 767]
6
3 [teller, bank, bca, ini, menanyakan, kabar, sa [O, O, O, O, O, O, O, B, O] [2190, 288, 317, 918, 1365, 983, 1917, 2081, 1...
7
4 [bca, senantiasa, menjaga, rahasia, data, cust [B, O, B, B, B, I] [317, 1983, 1458, 1824, 575, 551]
8
..
9
794 [hi, cs, kenapa, pelayanan, di, bca, kodya, te [O, B, O, B, O, B, I, I, I, I, I, I, O, B, O, [873, 540, 1077, 1657, 598, 317, 1136, 2175, 2...
10
795 [walau, sudah, prioritas, tetap, saja, antreny [O, O, B, O, O, B, B, O, O, B, O, O, O] [2374, 2107, 1791, 2281, 1885, 231, 1183, 282,
11
796 [selama, menggunakan, layanan, e-channel, bca, [O, B, O, B, I, O, O, O, B, I, B, I, B, B, B, [1966, 1427, 1198, 746, 317, 1520, 2288, 1341,
12
797 [mau, menabung, mau, simpan, uang, atau, pun, [O, B, O, B, B, O, O, B, B, I, O, O, O, B] [1306, 1361, 1306, 2055, 2335, 249, 1817, 1491...
13
798 [toko, daring, juga, kebanyakan, pakai, bca, m [B, I, O, O, B, I, I, O, O, O, B, B, I, I, B, [2297, 569, 976, 1037, 1609, 317, 1238, 258, 1...
14
15
[799 rows x 3 columns]
16