Pandas read multiindexed csv with blanks

Question

I'm struggling with properly loading a csv that has a multi lines header with blanks. The CSV looks like this: What I would like to get is: When I try to load with pd.read_csv(file, header=[0,1], sep=','), I end up with the following: Is there a way to get the desired result? Note: alternatively, I would accept this as a result:

Accepted Answer

Here is an automated way to fix the column index. First, pull the column level values into a DataFrame:columns = pd.DataFrame(df.columns.tolist())then rename the Unnamed: columns to NaN:columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nanand then forward-fill the NaNs:columns[0] = columns[0].fillna(method='ffill')so that columns now looks likeIn [314]: columnsOut[314]:      0  10  NaN  A1  NaN  B2    C  X3    C  Y4    C  Z5    D  X6    D  Y7    D  ZNow we can find the remaining NaNs and fill them with empty strings:mask = pd.isnull(columns[0])columns[0] = columns[0].fillna('')To make the first two columns, A and B, indexable as df['A'] and df['B'] &#8212; as though they were single-leveled &#8212; you could swap the values in the first and second columns:columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].valuesNow you can build a new MultiIndex and assign it to df.columns:df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())Putting it all together, if data is ,,C,,,D,,A,B,X,Y,Z,X,Y,Z1,2,3,4,5,6,7,83,4,5,6,7,8,9,0then import numpy as npimport pandas as pddf = pd.read_csv('data', header=[0,1], sep=',')columns = pd.DataFrame(df.columns.tolist())columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nancolumns[0] = columns[0].fillna(method='ffill')mask = pd.isnull(columns[0])columns[0] = columns[0].fillna('')columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].valuesdf.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())    print(df)yields   A  B  C        D               X  Y  Z  X  Y  Z0  1  2  3  4  5  6  7  81  3  4  5  6  7  8  9  0

Advertisement

Answer