I have a vtt file as following
JavaScript
x
21
21
1
WEBVTT
2
3
1
4
00:00:05.210 --> 00:00:07.710
5
In this lecture, we're
6
going to talk about
7
8
2
9
00:00:07.710 --> 00:00:10.815
10
pattern matching in strings
11
using regular expressions.
12
13
3
14
00:00:10.815 --> 00:00:13.139
15
Regular expressions or regexes
16
17
4
18
00:00:13.139 --> 00:00:15.825
19
are written in a condensed
20
formatting language.
21
I want to extract the fragments from the file and merge them into sentences. The output should look something like this
JavaScript
1
2
1
['In this lecture, we're going to talk about pattern matching in strings using regular expressions.', 'Regular expressions or regexes are written in a condensed formatting language.'
2
I am able to extract the fragments using this
JavaScript
1
4
1
pattern = r"[A-z0-9 ,.*?='";n-/%$#@!()]+"
2
3
content = [i for i in re.findall(pattern, text) if (re.search('[a-zA-Z]', i))]
4
I am not sure how to extract entire sentences instead of fragments.
Also note that this is just a sample of the vtt file. The entire vtt file contains around 630 fragments and some of the fragments also contains integers and other special characters in them
Any help is appreciated
Advertisement
Answer
Using re.sub
we can try first removing the unwanted repetitive text. Then, do a second replacement to replace remaining newlines with single spaces:
JavaScript
1
24
24
1
inp = """1
2
00:00:05.210 --> 00:00:07.710
3
In this lecture, we're
4
going to talk about
5
6
2
7
00:00:07.710 --> 00:00:10.815
8
pattern matching in strings
9
using regular expressions.
10
11
3
12
00:00:10.815 --> 00:00:13.139
13
Regular expressions or regexes
14
15
4
16
00:00:13.139 --> 00:00:15.825
17
are written in a condensed
18
formatting language."""
19
20
output = re.sub(r'(?:^|r?n)d+r?nd{2}:d{2}:d{2}.d{3} --> d{2}:d{2}:d{2}.d{3}r?n', '', inp)
21
output = re.sub(r'r?n', ' ', output)
22
sentences = re.findall(r'(.*?.)s*', output)
23
print(sentences)
24
This prints:
JavaScript
1
3
1
["In this lecture, we're going to talk about pattern matching in strings using regular expressions.",
2
'Regular expressions or regexes are written in a condensed formatting language.']
3