For the life of me I can’t figure out how to remove the duplicate index items created when writing a multiindex dataframe to CSV. While there is this answer out there, it doesn’t apply to me per se because my second level has all different values.
This is a chunk of the dataframe I have, it just goes on for each month
Month Measure JAN KWH Max KW Day/Hour Peak End Use Peak Pct FEB KWH Max KW Day/Hour Peak End Use Peak Pct
But in CSV it looks something like
JAN KWH JAN Max KW JAN Day/Hour JAN Peak End Use JAN Peak Pct FEB KWH FEB Max KW FEB Day/Hour FEB Peak End Use FEB Peak Pct
I hope for it to look exactly just like the dataframe so that means in CSV it would look like
Jan,KWH ,Max KW ,Day/Hour and so on
I have tried df.index.to_series().duplicates()
but it doesn’t work because each of my index pairs are not duplicates. I’ve also tried using index.tolist()
then looping through to change with no luck.
What’s the right approach here?
Thanks in advance!
Advertisement
Answer
It seems pandas does not provide directly such functionality, perhaps to ensure that the generated csv files can be read back, as mentioned in the comments above. I faced with the same problem when generating a table for report in rst format. Finally, I decided to process the generated csv by an external script. The script replaces the repeating values in columns with spaces, thus achieving desired result. Note: this script assumes commas as csv separators.
# Clean csv with multiindices from sys import argv # Positional command line arguments: # 1: original csv file # 2: number of columns to check (optional) with open(argv[1], 'r') as f: if len(argv) > 2: n = int(argv[2]) i0 = ('', ) * n else: n = None i0 = None for l in f: if n is None: # define n from the 1-st line i1 = l.split(',') n = len(i1) i0 = ('', ) * n i1 = l.split(',', n) nl = () for s0, s1 in zip(i0, i1): if s0 == s1: e = ' ' * len(s0) else: e = s1 nl += (e, ) nl += tuple(i1[n:]) print ','.join(nl), i0 = i1[:n]