Grouping speaker dialogue in a written transcript

Tags: , ,



I have a txt file for a transcript. Example content:

Travis de Ronde: What I guess was the largest challenge where should we start
Travis de Ronde: Here on this piece.
Tamil Ramasamy: I think we can talk about
Tamil Ramasamy: Investing can cover that later maybe
Ashwin Dora: This isn't other
Tamil Ramasamy: Other software. I mean, other. So this is a big problem.
Travis de Ronde: Okay, so what, what was the issue is dynamo dB.
Ashwin Dora: So dynamo DB. So we decided dynamo dB to be our database.
Ashwin Dora: Design for entire call center posit
Ashwin Dora: Because dynamo dB is really cool when it comes to accessing the data and like

I would like to write some python code that will give the following output:

Travis de Ronde: What I guess was the largest challenge where should we start Here on this piece.
Tamil Ramasamy: I think we can talk about Investing can cover that later maybe.
Ashwin Dora: This isn't other
Tamil Ramasamy: Other software. I mean, other. So this is a big problem.
Travis de Ronde: Okay, so what, what was the issue is dynamo dB.
Ashwin Dora: So dynamo DB. So we decided dynamo dB to be our database. Design for entire call center posit Because dynamo dB is really cool when it comes to accessing the data and like

So if Travis de Ronde is talking, for example, I want all of his dialogue to be on one “line” under his name until he is finished speaking or another speaker begins talking.

Answer

This is a very good job for itertools.groupby, not regular expressions:

data = """
Travis de Ronde: What I guess was the largest challenge where should we start
Travis de Ronde: Here on this piece.
Tamil Ramasamy: I think we can talk about
Tamil Ramasamy: Investing can cover that later maybe
Ashwin Dora: This isn't other
Tamil Ramasamy: Other software. I mean, other. So this is a big problem.
Travis de Ronde: Okay, so what, what was the issue is dynamo dB.
Ashwin Dora: So dynamo DB. So we decided dynamo dB to be our database.
Ashwin Dora: Design for entire call center posit
Ashwin Dora: Because dynamo dB is really cool when it comes to accessing the data and like
"""

from itertools import groupby
gen = (line for line in data.split("n") if line)
for speaker, text in groupby(gen, lambda line: line.split(": ")[0]):
    text = " ".join([x[len(speaker)+2:] for x in text])
    output = "{}: {}".format(speaker, text)
    print(output)

This yields

Travis de Ronde: What I guess was the largest challenge where should we start Here on this piece.
Tamil Ramasamy: I think we can talk about Investing can cover that later maybe
Ashwin Dora: This isn't other
Tamil Ramasamy: Other software. I mean, other. So this is a big problem.
Travis de Ronde: Okay, so what, what was the issue is dynamo dB.
Ashwin Dora: So dynamo DB. So we decided dynamo dB to be our database. Design for entire call center posit Because dynamo dB is really cool when it comes to accessing the data and like


Source: stackoverflow