I have a series of .txt file and I want to remove the prefix and suffix to make them easier to read (and do further analysis)
A dummy name would be something like “Test_abcdef_000001.txt”, “Test_abcdef_000002.txt” or “Test_abcdeft_000001.txt”
To remove the “Test_” and the “_000001.txt” part, I use rstrip() and lstrip() as followed:
for file in os.listdir(directory):
if file.endswith(".txt"):
if file.startswith("Test"):
print('old name is: '+file+'n')
file = file.lstrip('Test_')
for i in range(20):
if file.endswith(str(i).zfill(6)+'.txt'):
file_1 = file.rstrip('_'+str(i).zfill(6)+'.txt')
print('New name is: ' + file_1 +'n')
The first for loop is scan all the file within the directory. The second for loop with i is to deal with the _000001 or _000002 test name.
So, for example, with the following 4 test names, I’m expecting 4 “new” tests names:
Test_abcdtt_000001.txt –> abcdtt
Test_abct_000001.txt –> abct
Test_defg_000001.txt –> defg
Test_tcty_000001.txt –> tcty
However, in actual testing, I have the following result
Test_abcdtt_000001.txt –> abcd
Test_abct_000001.txt –> abc
Test_defg_000001.txt –> defg
Test_tcty_000001.txt –> cty
In other words, all “t” characters next to the “_” are lost, which is sub-optimal. Is there any advise/suggestion on this problem?
Thank you for your time and support.
For reference: I’m using Python 3.7 on my company computer. So just assume that I can NOT upgrade it to 3.9 and/or import any fancy library. In addition, some of my file may have _ inside them, for example Test_ab_ty_ui_000001.txt, and for this, the end result should be ab_ty_ui.
Advertisement
Answer
Maybe try using re
to match your desired pattern.
import re
prefix = "Test"
# regex to get everything between 'Test_' and '_{digits}'
regex = rf"^{prefix}_(.*)_(d+).txt"
# this could also be replaced with glob.glob(f"{directory}/{prefix}*") for be more efficient
for file_name in os.listdir(directory):
match = re.match(regex, file_name)
if match:
print(match.groups()[0])