Skip to content
Advertisement

Use list items as column seperators pd.read_fwf

I have text files containing tables which I want to put into a dataframe. Per file the column headers are the same, but the width is different depending on the content (because they contain names of different lengths for example).

So far I managed to get the index of the first character of the header, so I know where the column separation should be. As all files (about 100) have different widths I want to be able to insert this list in order to create a dataframe with the correct column width.

This is what the first two rows look like:

!Column1        !Column2     !Column3       !Column4     !Column5                  
Company with a  $1,000,000   Yes            Jack, Hank   X
name            
Company with.   $2,000       No             Rita, Hannah X
another name

What I tried so far:

(1)

pandas pd.read_fwf(‘file.txt’, colspec(()) – This does work, but with colspec I have to put in the (from, start) indexes for each column. Not only would this be burdensome manually, but some files have 12 columns while others have 10.

(2)

pandas pd.read_fwf(‘file.txt’, widhts(list)) – Here I can easily insert the list with locations of the column separations, but it does not seem to create a separation at those indexes. I do not exactly understand what is does.

Question:

I currently have a list of indexes of all the exclamation marks:

list = [0, 17, 30, 45, 58]

How can I use this list and separate the columns to convert the .txt file into a DataFrame?

Any other way to solve this issue is more than welcome!

Advertisement

Answer

So what you can do is standardize the spacing with regex.

import re
string = "something    something  something  more"
results = re.sub("(W+)", "|", string)
results

That returns

'something|something|something|more'

If you have standardized the delimiters, you can load it with fwf or just read_csv.

EDIT

In order to derive the span of the header that is delimited with a exclamation mark ! you can use the re library too. The logic of the pattern is that the sequence has to start with ! and then is followed up by many non-!. The next group would inherently start with a !. In code it would look something like this:

example_txt = """!Column1        !Column2     !Column3       !Column4     !Column5                  
Company with a  $1,000,000   Yes            Jack, Hank   X
name            
Company with.   $2,000       No             Rita, Hannah X
another name"""

first_line = example_txt.split("n")[0]

import re 

indexes = []
p = re.compile("![^!]*")
for m in p.finditer(first_line):
    indexes.append(m.span())

print(indexes)

Which returns

[(0, 16), (16, 29), (29, 44), (44, 57), (57, 83)]

This should bring you close to what you need for fwf method of pandas. Not that indexing in python starts at 0 and that if the end-index doesn’t count. So if you index from [0:16] then you would get the 0th to 15th element (not including the 16th element), returning 16 elements in total. The index can therefore be directly applied.

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement