I have the dataframe with two columns namely Content
which contains the text, and one more column named Coords
which is a list of tuples. Each tuple containing the meta info of each word of the text.
0,Category Procedure Date Results,"[('159', '384', 'bold', '40', '169', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1'), ('476', '382', 'bold', '32', '188', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1'), ('1091', '384', 'bold', '30', '84', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1'), ('1406', '382', 'bold', '32', '129', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1')]" 1,Echo/MUGA 2-D Echocardiogram 6/13/2018 Done at Hospital,"[('161', '437', None, '37', '222', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1'), ('474', '439', None, '30', '64', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1'), ('556', '437', None, '42', '289', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1'), ('1088', '439', None, '35', '186', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1'), ('1405', '439', None, '30', '93', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1'), ('1513', '441', None, '28', '31', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1'), ('1562', '437', None, '42', '142', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1')]"
I want to split the row such as each row can have a word, and its corresponding tuple, and the line number to act as a reference. Example:
LineNo Content Coords 1 Category ('159', '384', 'bold', '40', '169', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1') 1 Procedure ('476', '382', 'bold', '32', '188', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1') .... 2 at ('1513', '441', None, '28', '31', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1') 2 Hospital ('1562', '437', None, '42', '142', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1')
Advertisement
Answer
from ast import literal_eval df = pd.read_csv('<your csv file>', names=['LineNo', 'Content', 'Coords']) df['Coords'] = df['Coords'].apply(literal_eval) df = df.explode('Coords') df['Content'] = df.groupby('LineNo')['Content'].transform(lambda x: x.iloc[0].split()) print(df)
Prints:
LineNo Content Coords 0 0 Category (159, 384, bold, 40, 169, StyleId-E6BF91A3-3D6... 0 0 Procedure (476, 382, bold, 32, 188, StyleId-E6BF91A3-3D6... 0 0 Date (1091, 384, bold, 30, 84, StyleId-E6BF91A3-3D6... 0 0 Results (1406, 382, bold, 32, 129, StyleId-E6BF91A3-3D... 1 1 Echo/MUGA (161, 437, None, 37, 222, StyleId-E6BF91A3-3D6... 1 1 2-D (474, 439, None, 30, 64, StyleId-E6BF91A3-3D6A... 1 1 Echocardiogram (556, 437, None, 42, 289, StyleId-E6BF91A3-3D6... 1 1 6/13/2018 (1088, 439, None, 35, 186, StyleId-E6BF91A3-3D... 1 1 Done (1405, 439, None, 30, 93, StyleId-E6BF91A3-3D6... 1 1 at (1513, 441, None, 28, 31, StyleId-E6BF91A3-3D6... 1 1 Hospital (1562, 437, None, 42, 142, StyleId-E6BF91A3-3D...