I have text files containing tables which I want to put into a dataframe. Per file the column headers are the same, but the width is different depending on the content (because they contain names of different lengths for example).
So far I managed to get the index of the first character of the header, so I know where the column separation should be. As all files (about 100) have different widths I want to be able to insert this list in order to create a dataframe with the correct column width.
This is what the first two rows look like:
!Column1 !Column2 !Column3 !Column4 !Column5 Company with a $1,000,000 Yes Jack, Hank X name Company with. $2,000 No Rita, Hannah X another name
What I tried so far:
(1)
pandas pd.read_fwf(‘file.txt’, colspec(()) – This does work, but with colspec I have to put in the (from, start) indexes for each column. Not only would this be burdensome manually, but some files have 12 columns while others have 10.
(2)
pandas pd.read_fwf(‘file.txt’, widhts(list)) – Here I can easily insert the list with locations of the column separations, but it does not seem to create a separation at those indexes. I do not exactly understand what is does.
Question:
I currently have a list of indexes of all the exclamation marks:
list = [0, 17, 30, 45, 58]
How can I use this list and separate the columns to convert the .txt file into a DataFrame?
Any other way to solve this issue is more than welcome!
Advertisement
Answer
So what you can do is standardize the spacing with regex.
import re string = "something something something more" results = re.sub("(W+)", "|", string) results
That returns
'something|something|something|more'
If you have standardized the delimiters, you can load it with fwf or just read_csv.
EDIT
In order to derive the span of the header that is delimited with a exclamation mark !
you can use the re
library too. The logic of the pattern is that the sequence has to start with !
and then is followed up by many non-!
. The next group would inherently start with a !
. In code it would look something like this:
example_txt = """!Column1 !Column2 !Column3 !Column4 !Column5 Company with a $1,000,000 Yes Jack, Hank X name Company with. $2,000 No Rita, Hannah X another name""" first_line = example_txt.split("n")[0] import re indexes = [] p = re.compile("![^!]*") for m in p.finditer(first_line): indexes.append(m.span()) print(indexes)
Which returns
[(0, 16), (16, 29), (29, 44), (44, 57), (57, 83)]
This should bring you close to what you need for fwf
method of pandas. Not that indexing in python starts at 0
and that if the end-index doesn’t count. So if you index from [0:16]
then you would get the 0th to 15th element (not including the 16th element), returning 16 elements in total. The index can therefore be directly applied.