I have a txt file for a transcript. Example content:
JavaScriptx11111Travis de Ronde: What I guess was the largest challenge where should we start
2Travis de Ronde: Here on this piece.
3Tamil Ramasamy: I think we can talk about
4Tamil Ramasamy: Investing can cover that later maybe
5Ashwin Dora: This isn't other
6Tamil Ramasamy: Other software. I mean, other. So this is a big problem.
7Travis de Ronde: Okay, so what, what was the issue is dynamo dB.
8Ashwin Dora: So dynamo DB. So we decided dynamo dB to be our database.
9Ashwin Dora: Design for entire call center posit
10Ashwin Dora: Because dynamo dB is really cool when it comes to accessing the data and like
11
I would like to write some python code that will give the following output:
JavaScript
1
7
1
Travis de Ronde: What I guess was the largest challenge where should we start Here on this piece.
2
Tamil Ramasamy: I think we can talk about Investing can cover that later maybe.
3
Ashwin Dora: This isn't other
4
Tamil Ramasamy: Other software. I mean, other. So this is a big problem.
5
Travis de Ronde: Okay, so what, what was the issue is dynamo dB.
6
Ashwin Dora: So dynamo DB. So we decided dynamo dB to be our database. Design for entire call center posit Because dynamo dB is really cool when it comes to accessing the data and like
7
So if Travis de Ronde is talking, for example, I want all of his dialogue to be on one “line” under his name until he is finished speaking or another speaker begins talking.
Advertisement
Answer
This is a very good job for itertools.groupby
, not regular expressions:
JavaScript
1
20
20
1
data = """
2
Travis de Ronde: What I guess was the largest challenge where should we start
3
Travis de Ronde: Here on this piece.
4
Tamil Ramasamy: I think we can talk about
5
Tamil Ramasamy: Investing can cover that later maybe
6
Ashwin Dora: This isn't other
7
Tamil Ramasamy: Other software. I mean, other. So this is a big problem.
8
Travis de Ronde: Okay, so what, what was the issue is dynamo dB.
9
Ashwin Dora: So dynamo DB. So we decided dynamo dB to be our database.
10
Ashwin Dora: Design for entire call center posit
11
Ashwin Dora: Because dynamo dB is really cool when it comes to accessing the data and like
12
"""
13
14
from itertools import groupby
15
gen = (line for line in data.split("n") if line)
16
for speaker, text in groupby(gen, lambda line: line.split(": ")[0]):
17
text = " ".join([x[len(speaker)+2:] for x in text])
18
output = "{}: {}".format(speaker, text)
19
print(output)
20
This yields
JavaScript
1
7
1
Travis de Ronde: What I guess was the largest challenge where should we start Here on this piece.
2
Tamil Ramasamy: I think we can talk about Investing can cover that later maybe
3
Ashwin Dora: This isn't other
4
Tamil Ramasamy: Other software. I mean, other. So this is a big problem.
5
Travis de Ronde: Okay, so what, what was the issue is dynamo dB.
6
Ashwin Dora: So dynamo DB. So we decided dynamo dB to be our database. Design for entire call center posit Because dynamo dB is really cool when it comes to accessing the data and like
7