I’m struggling with properly loading a csv that has a multi lines header with blanks. The CSV looks like this:
,,C,,,D,, A,B,X,Y,Z,X,Y,Z 1,2,3,4,5,6,7,8
What I would like to get is:
When I try to load with pd.read_csv(file, header=[0,1], sep=',')
, I end up with the following:
Is there a way to get the desired result?
Note: alternatively, I would accept this as a result:
Versions used:
- Python: 2.7.8
- Pandas 0.16.0
Advertisement
Answer
Here is an automated way to fix the column index. First, pull the column level values into a DataFrame:
columns = pd.DataFrame(df.columns.tolist())
then rename the Unnamed:
columns to NaN
:
columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan
and then forward-fill the NaNs:
columns[0] = columns[0].fillna(method='ffill')
so that columns
now looks like
In [314]: columns Out[314]: 0 1 0 NaN A 1 NaN B 2 C X 3 C Y 4 C Z 5 D X 6 D Y 7 D Z
Now we can find the remaining NaNs and fill them with empty strings:
mask = pd.isnull(columns[0]) columns[0] = columns[0].fillna('')
To make the first two columns, A
and B
, indexable as df['A']
and df['B']
— as though they were single-leveled — you could swap the values in the first and second columns:
columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values
Now you can build a new MultiIndex and assign it to df.columns
:
df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())
Putting it all together, if data
is
,,C,,,D,, A,B,X,Y,Z,X,Y,Z 1,2,3,4,5,6,7,8 3,4,5,6,7,8,9,0
then
import numpy as np import pandas as pd df = pd.read_csv('data', header=[0,1], sep=',') columns = pd.DataFrame(df.columns.tolist()) columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan columns[0] = columns[0].fillna(method='ffill') mask = pd.isnull(columns[0]) columns[0] = columns[0].fillna('') columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist()) print(df)
yields
A B C D X Y Z X Y Z 0 1 2 3 4 5 6 7 8 1 3 4 5 6 7 8 9 0