I’m currently working on a project and I need to add specific rows whenever the tagged sentence ends. Whenever the ‘N’ column equals 1 it means that a new sentence started. I want to add two rows for each sentence: a row with ‘Pos’= START at the beginning of the sentence, and a row with ‘Pos’=End at the end of each row. This is what the DataFrame look like:
POSTAG = { 'N': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9,10,11,1,2,3,4,5,6,7,8,9], 'Name': ['ἐρᾷ','μὲν','ἁγνὸς','οὐρανὸς','τρῶσαι','χθόνα',',','ἔρως','δὲ','γαῖαν','λαμβάνει','γάμου','τυχεῖν','.','ὄμβρος','δ̓','ἀπ̓','εὐνάοντος','οὐρανοῦ','πεσὼν','ἔκυσε','γαῖαν','.','ἡ','δὲ','τίκτεται','βροτοῖς','μήλων','τε','βοσκὰς','καὶ','βίον','Δημήτριον','.','δενδρῶτις','ὥρα','δ̓','ἐκ','νοτίζοντος','γάμου','τέλειος','ἐστί','.'], 'Pos': ['VERB','ADV','ADJ','NOUN','VERB','NOUN','PUNCT','NOUN','CCONJ','NOUN','VERB','NOUN','VERB','PUNCT','NOUN','ADV','ADP','ADJ','NOUN','VERB','VERB','NOUN','PUNCT','DET','ADV','VERB','NOUN','NOUN','ADV','NOUN','CCONJ','NOUN','ADJ','PUNCT','NOUN','NOUN','ADV','ADP','VERB','NOUN','ADJ','VERB','PUNCT'] } df = pd.DataFrame(POSTAG, columns = ['N', 'Name','Pos']) print (df)
In this case I need a [Nan, Nan, START] tag at indexes 0 and 15. and a [Nan,Nan, END] tag at index 14. I need to make it for all my df. How could I do this?
Advertisement
Answer
Analyzing your dataframe, I just assume you want to insert START
before value 1
in column N
and insert END
after the max continuous value in column N
. If so, you could do following
First create two dummy dataframe start_df
and end_df
start_df = pd.DataFrame({'N': [np.nan], 'Name': [np.nan], 'Pos': ['->START']}) end_df = pd.DataFrame({'N': [np.nan], 'Name': [np.nan], 'Pos': ['END<-']})
Then split the dataframe with continuous value in column N
mask = ~df['N'].diff().fillna(0).eq(1) gb = df.groupby(mask.cumsum()) groups = [gb.get_group(x) for x in gb.groups]
Moreover, insert dummy dataframe before and after each group
res = [] for group in groups: res.append(start_df) res.append(group) res.append(end_df)
At last, create dataframe by concating dataframe in list
df_ = pd.concat(res).reset_index(drop=True)
# print(df_) N Name Pos 0 NaN NaN ->START 1 1.0 ἐρᾷ VERB 2 2.0 μὲν ADV 3 3.0 ἁγνὸς ADJ 4 4.0 οὐρανὸς NOUN 5 5.0 τρῶσαι VERB 6 6.0 χθόνα NOUN 7 7.0 , PUNCT 8 8.0 ἔρως NOUN 9 9.0 δὲ CCONJ 10 10.0 γαῖαν NOUN 11 11.0 λαμβάνει VERB 12 12.0 γάμου NOUN 13 13.0 τυχεῖν VERB 14 14.0 . PUNCT 15 NaN NaN END<- 16 NaN NaN ->START 17 1.0 ὄμβρος NOUN 18 2.0 δ̓ ADV 19 3.0 ἀπ̓ ADP 20 4.0 εὐνάοντος ADJ 21 5.0 οὐρανοῦ NOUN 22 6.0 πεσὼν VERB 23 7.0 ἔκυσε VERB 24 8.0 γαῖαν NOUN 25 9.0 . PUNCT 26 NaN NaN END<- 27 NaN NaN ->START 28 1.0 ἡ DET 29 2.0 δὲ ADV 30 3.0 τίκτεται VERB 31 4.0 βροτοῖς NOUN 32 5.0 μήλων NOUN 33 6.0 τε ADV 34 7.0 βοσκὰς NOUN 35 8.0 καὶ CCONJ 36 9.0 βίον NOUN 37 10.0 Δημήτριον ADJ 38 11.0 . PUNCT 39 NaN NaN END<- 40 NaN NaN ->START 41 1.0 δενδρῶτις NOUN 42 2.0 ὥρα NOUN 43 3.0 δ̓ ADV 44 4.0 ἐκ ADP 45 5.0 νοτίζοντος VERB 46 6.0 γάμου NOUN 47 7.0 τέλειος ADJ 48 8.0 ἐστί VERB 49 9.0 . PUNCT 50 NaN NaN END<-