Skip to content
Advertisement

How do I iterate through an entire directory and select only one class from a multi-class file in Python?

I could use some help iterating through a directory with multi-class files. Each sample contains two classes (for example, the first sample in my database is 1001, and this file includes 1001.dat and 1001.hea), and I want to iterate through my directory and access all .dat files separately from .hea files. Right now, simply iterating through the directory produces a File-Not-Found error.

I’ll supply additional source code to give this some context, but first let me show you where I’m stuck.

Using a PhysioNet ECG database, the goal right now is to analyze every .dat file (my example below implements the Dickey-Fuller test, using adfuller from statsmodels.tsa.stattools). I have uploaded my data onto Google colab using the following:

from google.colab import files
uploaded = files.upload()

I am able to access a specific sample from my database easily. For example, if I want to read a sample using WFDB, I can do this without a problem:

wfdb.rdsamp('1001') #1001 is the name of the first sample in my directory

But when I try to iterate through all of these samples, I run into an issue. Here is what I have so far:

for dat in uploaded:
     file = wfdb.rdsamp(dat) #this is where I get the error (below)

At the commented line, I get the following error:

FileNotFoundError: [Errno 2] No such file or directory: '/content/1001.dat.hea'

I believe this is because each file contains two classes, as you can see when I print the type of my file…

Sourcecode:

print(type(uploaded)) #print directory 'uploaded' type (declared in first code block)
for dat in uploaded:  #iterate through directory 'uploaded'
  print(type(dat))    #print file type

Result:

<class 'dict'>
<class 'str'>
<class 'str'>

So, what I want to do is specify the first class ‘str’ (which is .dat). I only need to use the data contained in 1001.dat, etc. I just don’t know how to specify this in Python.

Now, as promised, some more code for more context.

All this stuff works:

#get records. 
sample = '1001' #first sample in database
record = wfdb.rdrecord(sample)                  #read record
FHR = (wfdb.rdsamp(sample))[0][:,0]             #FHR with 0's; FHR = fetal heart rate
newFHR = [i for i in FHR if i > 0]              #FHR without values <= 0

#plot sample record
wfdb.plot_wfdb(record = record, title = sample)

DF_test_result = adfuller(FHR)     #Dickey-Fuller Test

#print results
print ("Results with values <= 0")
print ( "ADF:")
ADF = DF_test_result[0]
print(ADF, "n")

DF_test_result = adfuller(newFHR)  #Dickey-Fuller Test

#print results
print ("Results with values > 0")
print ( "ADF:")
ADF = DF_test_result[0]
print(ADF, "n")

This is what I’m working on now. My syntax might not be entirely correct for the body of my for loop (again, I’m a Python newbie) but I can figure out the rest if I can access the correct samples for each iteration:

#declare arrays for adf & pvals
adf = []
pvals = []

#get records
for dat in uploaded:
  file = wfdb.rdsamp(dat) #ERROR IS HERE
  FHR = file[0][:,0]                              #FHR with 0's
  newFHR = [i for i in FHR if i > 0]              #FHR without 0's

  DF_test_result = adfuller(newFHR)               #Dickey-Fuller Test
  adf.append(DF_test_result[0])                   #add adf
  pvals.append(DF_test_result[1])                 #add pvals

Thank you, and absolutely let me know how I could have formatted this post better. I am still learning how to post useful questions on this platform. This is my 3rd question ever on StackOverflow.

Advertisement

Answer

I found the answer through a little more exploration of the method rdsamp()

rdsamp() does not need an extension to read the correct .dat file. This is why rdsamp('1001') works.

The solution, then, is to take out the last 4 characters in the string:

for file in uploaded: print(file[:-4]) file = wfdb.rdsamp(file[:-4])

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement