Skip to content
Advertisement

How to use regex to scrape entire sentences from fragments in python

I have a vtt file as following

JavaScript

I want to extract the fragments from the file and merge them into sentences. The output should look something like this

JavaScript

I am able to extract the fragments using this

JavaScript

I am not sure how to extract entire sentences instead of fragments.

Also note that this is just a sample of the vtt file. The entire vtt file contains around 630 fragments and some of the fragments also contains integers and other special characters in them

Any help is appreciated

Advertisement

Answer

Using re.sub we can try first removing the unwanted repetitive text. Then, do a second replacement to replace remaining newlines with single spaces:

JavaScript

This prints:

JavaScript
Advertisement