I have a vtt file as following
WEBVTT 1 00:00:05.210 --> 00:00:07.710 In this lecture, we're going to talk about 2 00:00:07.710 --> 00:00:10.815 pattern matching in strings using regular expressions. 3 00:00:10.815 --> 00:00:13.139 Regular expressions or regexes 4 00:00:13.139 --> 00:00:15.825 are written in a condensed formatting language.
I want to extract the fragments from the file and merge them into sentences. The output should look something like this
['In this lecture, we're going to talk about pattern matching in strings using regular expressions.', 'Regular expressions or regexes are written in a condensed formatting language.'
I am able to extract the fragments using this
pattern = r"[A-z0-9 ,.*?='";n-/%$#@!()]+" content = [i for i in re.findall(pattern, text) if (re.search('[a-zA-Z]', i))]
I am not sure how to extract entire sentences instead of fragments.
Also note that this is just a sample of the vtt file. The entire vtt file contains around 630 fragments and some of the fragments also contains integers and other special characters in them
Any help is appreciated
Advertisement
Answer
Using re.sub
we can try first removing the unwanted repetitive text. Then, do a second replacement to replace remaining newlines with single spaces:
inp = """1 00:00:05.210 --> 00:00:07.710 In this lecture, we're going to talk about 2 00:00:07.710 --> 00:00:10.815 pattern matching in strings using regular expressions. 3 00:00:10.815 --> 00:00:13.139 Regular expressions or regexes 4 00:00:13.139 --> 00:00:15.825 are written in a condensed formatting language.""" output = re.sub(r'(?:^|r?n)d+r?nd{2}:d{2}:d{2}.d{3} --> d{2}:d{2}:d{2}.d{3}r?n', '', inp) output = re.sub(r'r?n', ' ', output) sentences = re.findall(r'(.*?.)s*', output) print(sentences)
This prints:
["In this lecture, we're going to talk about pattern matching in strings using regular expressions.", 'Regular expressions or regexes are written in a condensed formatting language.']