I have an input file such as this and the program removes everything but the hindi text.
JavaScript
x
12
12
1
1
2
00:00:10,240 --> 00:00:13,824
3
विकास नाम का एक गरीब मजदूर था
4
5
2
6
00:00:14,592 --> 00:00:15,360
7
जो सेठ
8
9
3
10
00:00:15,616 --> 00:00:16,896
11
भीमसेन के यहां
12
Here is my program
JavaScript
1
50
50
1
#!/usr/bin/python
2
# -*- coding:utf-8 -*-
3
4
import sys
5
import re
6
import string
7
import codecs
8
9
def del_brackets(s):
10
a = re.compile(r'<.*?>')
11
result = a.sub('', s)
12
return result.strip('n').strip()
13
14
def del_brackets2(s):
15
a = re.compile(r'[.*?]')
16
result = a.sub('', s)
17
return result.strip('n').strip()
18
19
def del_brackets3(s):
20
a = re.compile(r'{.*?}')
21
result = a.sub('', s)
22
return result.strip('n').strip()
23
24
def del_brackets4(s):
25
a = re.compile(r'(.*?)')
26
result = a.sub('', s)
27
return result.strip('n').strip()
28
29
with open(sys.argv[1], 'r') as f:
30
lines = f.readlines()
31
32
outfile = open(sys.argv[1].replace('.srt', '.txt'), 'w')
33
34
exclude = set('♪"#$%&()*+-/:<=>@[\]^_`{|}')
35
for line in lines:
36
# print(repr(line))
37
line = line.strip()
38
#line = unicode(line.strip('n'), 'utf-8')
39
if len(line.strip()) != 0 and line != 1 and line != "1":
40
if (not line.isdigit()) and ('-->' not in line):
41
line = del_brackets(line)
42
line = del_brackets2(line)
43
line = del_brackets3(line)
44
line = del_brackets4(line)
45
line = ' '.join(''.join(' ' if ch in exclude else ch for ch in line).split())
46
line = re.sub(r'...', ' ', line)
47
outfile.write(line.lstrip() + "n")
48
49
outfile.close()
50
and the expected output is below
JavaScript
1
4
1
विकास नाम का एक गरीब मजदूर था
2
जो सेठ
3
भीमसेन के यहां
4
However, my program doesn’t recognize the first line digit, and instead it returns
JavaScript
1
5
1
1
2
विकास नाम का एक गरीब मजदूर था
3
जो सेठ
4
भीमसेन के यहां
5
Why does this program doesn’t recognize the digit when I specifically wrote 1 or “1”?
Advertisement
Answer
Using regex we can create a simple expression that covers the three cases that you want to ignore:
- timestamp line
- number line
- empty line
From there we can use python’s built-in filter
method to filter out all of the undesired lines, and use the filter
results as the lines to write.
JavaScript
1
12
12
1
import sys, re
2
3
def pruneSRTtoTXT(fn):
4
fn2 = fn.replace('.srt', '.txt')
5
stamp = '[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}'
6
ignore = re.compile(f'^({stamp}s-->s{stamp}|[0-9]+|[rn]+)$', re.M)
7
8
with open(fn, 'r') as f, open(fn2, 'w') as f2:
9
f2.writelines(filter(lambda l: not ignore.search(l), f.readlines()))
10
11
pruneSRTtoTXT(sys.argv[1])
12