Skip to content
Advertisement

Can’t figure out why pandas.concat is creating extra column when concatenating two frames

I’ve been trying to concatenate two sheets while preserving the original indices of both dataframes. However, upon concatenation I can’t seem to get the result to output the way I expect or want.

If I use ignore_index = True The old indices are replaced by an index that encompasses both sheets total rows. If I use ignore_index = False The indices are preserved, but there is a new empty column preceding it shifting the rest of columns over by one.

How can I concatenate my sheets without this excess column?

import pandas as pd
import easygui

path = easygui.fileopenbox("Select a file")
xls = pd.ExcelFile(path)

potential_names = [sheet for sheet in xls.sheet_names if sheet.startswith('Numbers_')]

df = pd.concat(pd.read_excel(xls, sheet_name = potential_names), ignore_index = True)

command_list = ["AA", "dd", "DD"]
warning_list = ["Dd", "dD"]
ingenium_list = ["CC", "BB"]

col_name_type = 'TYPE'
col_name_other = 'OTHER'
col_name_source = 'SOURCE'
df_filtered_command = df[df[col_name_type].isin(command_list)]
df_filtered_warnings = df[df[col_name_type].isin(warning_list)]
df_filtered_other = df[df[col_name_other].isin(ingenium_list)]

with pd.ExcelWriter(('(Processed) ' + os.path.basename(path))) as writer:
    #df_filtered_command.to_excel(writer, sheet_name="Command", index=True)
    df_final_command.to_excel(writer, sheet_name=("Command"), index=True)
    df_filtered_warnings.to_excel(writer, sheet_name="Warnings", index=True)
    df_filtered_other.to_excel(writer, sheet_name="Issues", index=True)

I suspect how the concat function is working, but I’ve not been able to figure out how to fix it.

Any help or direction would be amazing.

Edit: adding an example of my df after running.

Example output

I was mistaken before, it seems the first column is empty aside from the sheet names, but I’m still not able to find a way to prevent pandas from making that first column if I don’t remake the index.

Advertisement

Answer

Since you passed in a dictionary (not a list) of data frames from using a list in sheet_names of pandas.read_excel, pandas.concat will preserve the dict keys, as specified for the first argument:

objs: a sequence or mapping of Series or DataFrame objects

If a mapping is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below).

Consequently, the sheet names (i.e., dict keys) and the index values of each data frame migrate as a MultIindex. Consider using the names argument to name the indexes:

names: list, default None

Names for the levels in the resulting hierarchical index.

sheets_df = pd.concat(
    pd.read_excel(xls, sheet_name = potential_names),
    names = ["sheet_name", "index"]
)

If you want to convert the dual index as new columns, simple run reset_index afterwards:

sheets_df = (
    pd.concat(
        pd.read_excel(xls, sheet_name = potential_names),
        names = ["sheet_name", "index"]
    ).reset_index()
)
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement