I have the dataframe with two columns namely Content
which contains the text, and one more column named Coords
which is a list of tuples. Each tuple containing the meta info of each word of the text.
JavaScript
x
3
1
0,Category Procedure Date Results,"[('159', '384', 'bold', '40', '169', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1'), ('476', '382', 'bold', '32', '188', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1'), ('1091', '384', 'bold', '30', '84', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1'), ('1406', '382', 'bold', '32', '129', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1')]"
2
1,Echo/MUGA 2-D Echocardiogram 6/13/2018 Done at Hospital,"[('161', '437', None, '37', '222', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1'), ('474', '439', None, '30', '64', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1'), ('556', '437', None, '42', '289', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1'), ('1088', '439', None, '35', '186', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1'), ('1405', '439', None, '30', '93', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1'), ('1513', '441', None, '28', '31', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1'), ('1562', '437', None, '42', '142', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1')]"
3
I want to split the row such as each row can have a word, and its corresponding tuple, and the line number to act as a reference. Example:
JavaScript
1
10
10
1
LineNo Content Coords
2
1 Category ('159', '384', 'bold', '40', '169', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1')
3
4
1 Procedure ('476', '382', 'bold', '32', '188', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1')
5
6
.
7
8
2 at ('1513', '441', None, '28', '31', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1')
9
2 Hospital ('1562', '437', None, '42', '142', 'StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1')
10
Advertisement
Answer
JavaScript
1
10
10
1
from ast import literal_eval
2
3
df = pd.read_csv('<your csv file>', names=['LineNo', 'Content', 'Coords'])
4
5
df['Coords'] = df['Coords'].apply(literal_eval)
6
df = df.explode('Coords')
7
df['Content'] = df.groupby('LineNo')['Content'].transform(lambda x: x.iloc[0].split())
8
9
print(df)
10
Prints:
JavaScript
1
13
13
1
LineNo Content Coords
2
0 0 Category (159, 384, bold, 40, 169, StyleId-E6BF91A3-3D6
3
0 0 Procedure (476, 382, bold, 32, 188, StyleId-E6BF91A3-3D6
4
0 0 Date (1091, 384, bold, 30, 84, StyleId-E6BF91A3-3D6
5
0 0 Results (1406, 382, bold, 32, 129, StyleId-E6BF91A3-3D
6
1 1 Echo/MUGA (161, 437, None, 37, 222, StyleId-E6BF91A3-3D6
7
1 1 2-D (474, 439, None, 30, 64, StyleId-E6BF91A3-3D6A
8
1 1 Echocardiogram (556, 437, None, 42, 289, StyleId-E6BF91A3-3D6
9
1 1 6/13/2018 (1088, 439, None, 35, 186, StyleId-E6BF91A3-3D
10
1 1 Done (1405, 439, None, 30, 93, StyleId-E6BF91A3-3D6
11
1 1 at (1513, 441, None, 28, 31, StyleId-E6BF91A3-3D6
12
1 1 Hospital (1562, 437, None, 42, 142, StyleId-E6BF91A3-3D
13