Skip to content
Advertisement

How to use regex to scrape entire sentences from fragments in python

I have a vtt file as following

WEBVTT

1
00:00:05.210 --> 00:00:07.710
In this lecture, we're
going to talk about

2
00:00:07.710 --> 00:00:10.815
pattern matching in strings
using regular expressions.

3
00:00:10.815 --> 00:00:13.139
Regular expressions or regexes

4
00:00:13.139 --> 00:00:15.825
are written in a condensed
formatting language.

I want to extract the fragments from the file and merge them into sentences. The output should look something like this

['In this lecture, we're going to talk about pattern matching in strings using regular expressions.', 'Regular expressions or regexes are written in a condensed formatting language.'

I am able to extract the fragments using this

pattern = r"[A-z0-9 ,.*?='";n-/%$#@!()]+"

content = [i for i in re.findall(pattern, text) if (re.search('[a-zA-Z]', i))]

I am not sure how to extract entire sentences instead of fragments.

Also note that this is just a sample of the vtt file. The entire vtt file contains around 630 fragments and some of the fragments also contains integers and other special characters in them

Any help is appreciated

Advertisement

Answer

Using re.sub we can try first removing the unwanted repetitive text. Then, do a second replacement to replace remaining newlines with single spaces:

inp = """1
00:00:05.210 --> 00:00:07.710
In this lecture, we're
going to talk about

2
00:00:07.710 --> 00:00:10.815
pattern matching in strings
using regular expressions.

3
00:00:10.815 --> 00:00:13.139
Regular expressions or regexes

4
00:00:13.139 --> 00:00:15.825
are written in a condensed
formatting language."""

output = re.sub(r'(?:^|r?n)d+r?nd{2}:d{2}:d{2}.d{3} --> d{2}:d{2}:d{2}.d{3}r?n', '', inp)
output = re.sub(r'r?n', ' ', output)
sentences = re.findall(r'(.*?.)s*', output)
print(sentences)

This prints:

["In this lecture, we're going to talk about pattern matching in strings using regular expressions.",
 'Regular expressions or regexes are written in a condensed formatting language.']
Advertisement