How to increase process speed using read_excel in pandas?

Question

I need use pd.read_excel to process every sheet in one excel file. But in most cases,I did not know the sheet name. So I use this to judge how many sheet in excel: During the process,I found that the process is quite slow, So,can read_excel only read limited rows to improve the speed? I tried nrows but did not work..still

Accepted Answer

Read all worksheets without guessingUse sheet_name = None argument to pd.read_excel. This will read all worksheets into a dictionary of dataframes. For example:dfs = pd.read_excel('file.xlsx', sheet_name=None)# access 'Sheet1' worksheetres = dfs['Sheet1']Limit number of rows or columnsYou can use parse_cols and skip_footer arguments to limit the number of columns and/or rows. This will reduce read time, and also works with sheet_name = None.For example, the following will read the first 3 columns and, if your worksheet has 100 rows, it will read only the first 20.df = pd.read_excel('file.xlsx', sheet_name=None, parse_cols='A:C', skip_footer=80)If you wish to apply worksheet-specific logic, you can do so by extracting sheet_names:sheet_names = pd.ExcelFile('file.xlsx', on_demand=True).sheet_namesdfs = {}for sheet in sheet_names:    dfs[sheet] = pd.read_excel('file.xlsx', sheet)Improving performanceReading Excel files into Pandas is naturally slower than other options (CSV, Pickle, HDF5). If you wish to improve performance, I strongly suggest you consider these other formats.One option, for example, is to use a VBA script to convert your Excel worksheets to CSV files; then use pd.read_csv.Edit 02 Nov: correct sheetname to sheet_name

Advertisement

Answer

Read all worksheets without guessing

Limit number of rows or columns

Improving performance