Skip to content
Advertisement

pandas to_excel converts _x10e6 _ to ღ. How do I avoid this?

I have been trying to create an excel file with several sheets from delta tables, however some of my column names include _x10e6 _ which is apparently translated to ღ. I have tried to use encoding=’unicode_escape’ and encoding=’utf-8′ without luck. I cannot use xlsxwriter because I am appending to an existing file.

Does anybody know how I can keep _x10e6 _ in column names?

A part of the code I am using:

goldDF = spark.read.format('csv')
    .option('header', True)
    .option('delimiter', ',')
    .load(file_path)

pandaDF = goldDF.toPandas()


with pd.ExcelWriter('/tmp/output.xlsx',mode='a', engine='openpyxl') as writer:  
    pandaDF.to_excel(writer, sheet_name=name, index=False)

dbutils.fs.cp("file:/tmp/output.xlsx", "/mnt/delta_lake_path")    

Advertisement

Answer

The issue is that _x10e6_ is an XML escape code for the character 0x10E6 which equates to the Unicode character “ღ”.

In a similar case Excel would escape this as _x005F_x10e6_. It looks like openpyxl doesn’t do this so you could do it manually using the string _x005F_x10e6_ or (for anyone reading this who isn’t constrained by the append mode requirement) switch the “engine” to xlsxwriter, which will escape it automatically.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement