Skip to content
Advertisement

Pandas read multiindexed csv with blanks

I’m struggling with properly loading a csv that has a multi lines header with blanks. The CSV looks like this:

,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8

CSV Header

What I would like to get is:

Desired Pandas Header

When I try to load with pd.read_csv(file, header=[0,1], sep=','), I end up with the following:

Incorrect result

Is there a way to get the desired result?


Note: alternatively, I would accept this as a result:

Alternative result


Versions used:

  • Python: 2.7.8
  • Pandas 0.16.0

Advertisement

Answer

Here is an automated way to fix the column index. First, pull the column level values into a DataFrame:

columns = pd.DataFrame(df.columns.tolist())

then rename the Unnamed: columns to NaN:

columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan

and then forward-fill the NaNs:

columns[0] = columns[0].fillna(method='ffill')

so that columns now looks like

In [314]: columns
Out[314]: 
     0  1
0  NaN  A
1  NaN  B
2    C  X
3    C  Y
4    C  Z
5    D  X
6    D  Y
7    D  Z

Now we can find the remaining NaNs and fill them with empty strings:

mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')

To make the first two columns, A and B, indexable as df['A'] and df['B'] — as though they were single-leveled — you could swap the values in the first and second columns:

columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values

Now you can build a new MultiIndex and assign it to df.columns:

df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())

Putting it all together, if data is

,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
3,4,5,6,7,8,9,0

then

import numpy as np
import pandas as pd
df = pd.read_csv('data', header=[0,1], sep=',')
columns = pd.DataFrame(df.columns.tolist())
columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan
columns[0] = columns[0].fillna(method='ffill')
mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')
columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values
df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())
    print(df)

yields

   A  B  C        D      
         X  Y  Z  X  Y  Z
0  1  2  3  4  5  6  7  8
1  3  4  5  6  7  8  9  0
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement