Skip to content
Advertisement

How to reindex a dataframe post splitting a row w.r.t a column?

I have the dataframe with two columns namely Content which contains the text, and one more column named Coords which is a list of tuples. Each tuple containing the meta info of each word of the text.

0,Category Procedure Date Results,"[('159', '384', 'bold', '40', '169', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1'), ('476', '382', 'bold', '32', '188', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1'), ('1091', '384', 'bold', '30', '84', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1'), ('1406', '382', 'bold', '32', '129', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1')]"
1,Echo/MUGA 2-D Echocardiogram 6/13/2018 Done at Hospital,"[('161', '437', None, '37', '222', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1'), ('474', '439', None, '30', '64', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1'), ('556', '437', None, '42', '289', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1'), ('1088', '439', None, '35', '186', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1'), ('1405', '439', None, '30', '93', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1'), ('1513', '441', None, '28', '31', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1'), ('1562', '437', None, '42', '142', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1')]"

I want to split the row such as each row can have a word, and its corresponding tuple, and the line number to act as a reference. Example:

LineNo  Content   Coords
1       Category  ('159', '384', 'bold', '40', '169', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1')

1       Procedure ('476', '382', 'bold', '32', '188', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1')

 ....

2      at         ('1513', '441', None, '28', '31', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1')
2      Hospital    ('1562', '437', None, '42', '142', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1')

Advertisement

Answer

from ast import literal_eval

df = pd.read_csv('<your csv file>', names=['LineNo', 'Content', 'Coords'])

df['Coords'] = df['Coords'].apply(literal_eval)
df = df.explode('Coords')
df['Content'] = df.groupby('LineNo')['Content'].transform(lambda x: x.iloc[0].split())

print(df)

Prints:

   LineNo         Content                                             Coords
0       0        Category  (159, 384, bold, 40, 169, StyleId-E6BF91A3-3D6...
0       0       Procedure  (476, 382, bold, 32, 188, StyleId-E6BF91A3-3D6...
0       0            Date  (1091, 384, bold, 30, 84, StyleId-E6BF91A3-3D6...
0       0         Results  (1406, 382, bold, 32, 129, StyleId-E6BF91A3-3D...
1       1       Echo/MUGA  (161, 437, None, 37, 222, StyleId-E6BF91A3-3D6...
1       1             2-D  (474, 439, None, 30, 64, StyleId-E6BF91A3-3D6A...
1       1  Echocardiogram  (556, 437, None, 42, 289, StyleId-E6BF91A3-3D6...
1       1       6/13/2018  (1088, 439, None, 35, 186, StyleId-E6BF91A3-3D...
1       1            Done  (1405, 439, None, 30, 93, StyleId-E6BF91A3-3D6...
1       1              at  (1513, 441, None, 28, 31, StyleId-E6BF91A3-3D6...
1       1        Hospital  (1562, 437, None, 42, 142, StyleId-E6BF91A3-3D...
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement