I have a 14MB Excel file with five worksheets that I’m reading into a Pandas dataframe, and although the code below works, it takes 9 minutes!
Does anyone have suggestions for speeding it up?
JavaScript
x
18
18
1
import pandas as pd
2
3
def OTT_read(xl,site_name):
4
df = pd.read_excel(xl.io,site_name,skiprows=2,parse_dates=0,index_col=0,
5
usecols=[0,1,2],header=None,
6
names=['date_time','%s_depth'%site_name,'%s_temp'%site_name])
7
return df
8
9
def make_OTT_df(FILEDIR,OTT_FILE):
10
xl = pd.ExcelFile(FILEDIR + OTT_FILE)
11
site_names = xl.sheet_names
12
df_list = [OTT_read(xl,site_name) for site_name in site_names]
13
return site_names,df_list
14
15
FILEDIR='c:/downloads/'
16
OTT_FILE='OTT_Data_All_stations.xlsx'
17
site_names_OTT,df_list_OTT = make_OTT_df(FILEDIR,OTT_FILE)
18
Advertisement
Answer
As others have suggested, csv reading is faster. So if you are on windows and have Excel, you could call a vbscript to convert the Excel to csv and then read the csv. I tried the script below and it took about 30 seconds.
JavaScript
1
12
12
1
# create a list with sheet numbers you want to process
2
sheets = map(str,range(1,6))
3
4
# convert each sheet to csv and then read it using read_csv
5
df={}
6
from subprocess import call
7
excel='C:\Users\rsignell\OTT_Data_All_stations.xlsx'
8
for sheet in sheets:
9
csv = 'C:\Users\rsignell\test' + sheet + '.csv'
10
call(['cscript.exe', 'C:\Users\rsignell\ExcelToCsv.vbs', excel, csv, sheet])
11
df[sheet]=pd.read_csv(csv)
12
Here’s a little snippet of python to create the ExcelToCsv.vbs script:
JavaScript
1
31
31
1
#write vbscript to file
2
vbscript="""if WScript.Arguments.Count < 3 Then
3
WScript.Echo "Please specify the source and the destination files. Usage: ExcelToCsv <xls/xlsx source file> <csv destination file> <worksheet number (starts at 1)>"
4
Wscript.Quit
5
End If
6
7
csv_format = 6
8
9
Set objFSO = CreateObject("Scripting.FileSystemObject")
10
11
src_file = objFSO.GetAbsolutePathName(Wscript.Arguments.Item(0))
12
dest_file = objFSO.GetAbsolutePathName(WScript.Arguments.Item(1))
13
worksheet_number = CInt(WScript.Arguments.Item(2))
14
15
Dim oExcel
16
Set oExcel = CreateObject("Excel.Application")
17
18
Dim oBook
19
Set oBook = oExcel.Workbooks.Open(src_file)
20
oBook.Worksheets(worksheet_number).Activate
21
22
oBook.SaveAs dest_file, csv_format
23
24
oBook.Close False
25
oExcel.Quit
26
""";
27
28
f = open('ExcelToCsv.vbs','w')
29
f.write(vbscript.encode('utf-8'))
30
f.close()
31
This answer benefited from Convert XLS to CSV on command line and csv & xlsx files import to pandas data frame: speed issue