I have a txt file for a transcript. Example content:
Travis de Ronde: What I guess was the largest challenge where should we start Travis de Ronde: Here on this piece. Tamil Ramasamy: I think we can talk about Tamil Ramasamy: Investing can cover that later maybe Ashwin Dora: This isn't other Tamil Ramasamy: Other software. I mean, other. So this is a big problem. Travis de Ronde: Okay, so what, what was the issue is dynamo dB. Ashwin Dora: So dynamo DB. So we decided dynamo dB to be our database. Ashwin Dora: Design for entire call center posit Ashwin Dora: Because dynamo dB is really cool when it comes to accessing the data and like
I would like to write some python code that will give the following output:
Travis de Ronde: What I guess was the largest challenge where should we start Here on this piece. Tamil Ramasamy: I think we can talk about Investing can cover that later maybe. Ashwin Dora: This isn't other Tamil Ramasamy: Other software. I mean, other. So this is a big problem. Travis de Ronde: Okay, so what, what was the issue is dynamo dB. Ashwin Dora: So dynamo DB. So we decided dynamo dB to be our database. Design for entire call center posit Because dynamo dB is really cool when it comes to accessing the data and like
So if Travis de Ronde is talking, for example, I want all of his dialogue to be on one “line” under his name until he is finished speaking or another speaker begins talking.
Advertisement
Answer
This is a very good job for itertools.groupby
, not regular expressions:
data = """ Travis de Ronde: What I guess was the largest challenge where should we start Travis de Ronde: Here on this piece. Tamil Ramasamy: I think we can talk about Tamil Ramasamy: Investing can cover that later maybe Ashwin Dora: This isn't other Tamil Ramasamy: Other software. I mean, other. So this is a big problem. Travis de Ronde: Okay, so what, what was the issue is dynamo dB. Ashwin Dora: So dynamo DB. So we decided dynamo dB to be our database. Ashwin Dora: Design for entire call center posit Ashwin Dora: Because dynamo dB is really cool when it comes to accessing the data and like """ from itertools import groupby gen = (line for line in data.split("n") if line) for speaker, text in groupby(gen, lambda line: line.split(": ")[0]): text = " ".join([x[len(speaker)+2:] for x in text]) output = "{}: {}".format(speaker, text) print(output)
This yields
Travis de Ronde: What I guess was the largest challenge where should we start Here on this piece. Tamil Ramasamy: I think we can talk about Investing can cover that later maybe Ashwin Dora: This isn't other Tamil Ramasamy: Other software. I mean, other. So this is a big problem. Travis de Ronde: Okay, so what, what was the issue is dynamo dB. Ashwin Dora: So dynamo DB. So we decided dynamo dB to be our database. Design for entire call center posit Because dynamo dB is really cool when it comes to accessing the data and like