Skip to content
Advertisement

How to extract Excel PivotCache into Pandas Data Frame?

First time posting here, I apologize if this question has been asked before – I can’t find anything that applies.

Is there a way to read the underlying data from an Excel PivotTable into a Pandas Data Frame? For several years I’ve had an Excel Auto_Open macro that downloads several Excel files and double clicks on the “Grand Total” row in order to extract all of the data, which ultimate gets imported into a database. This is done because the owners of the source data refuse to grant access to the database itself.

This macro has never been the ideal scenario and we need to move it to a better method soon. I have extensive SQL knowledge but have only recently begun to learn Python.

I have been able to read worksheets using OpenPyXl, but these files do not contain the source data on a separate worksheet by default – the pivotcache must be extracted to a new sheet first. What I would like to do, if possible, is read from the Excel PivotCache into a Pandas Data Frame and either save that output as a CSV or load it directly into our database. It seems that this is not capable with OpenPyXl and that I’ll probably need to use win32com.client.

Does anybody have any experience with this, and know if it’s even possible? Any pointers for where I might get started? I’ve tried several items from the Excel Object model (PivotCache, GetData, etc etc) but either I don’t know how to use them or they don’t return what I need.

Any help would be much appreciated. Thanks!

Advertisement

Answer

This answer is very late, but I came up with it while struggling with the same issue, and some of the comments above helped me nail it.

In essence, the steps one can take to solve this with openpyxl are:

  1. Use openpyxl to get the openpyxl.pivot.table.TableDefinition object from the desired pivot table (let’s call it my_pivot_table)
  2. Get cached fields and their values from my_pivot_table.cache.cacheFields
  3. Get rows data as a dict in two sub-steps:
    • 3.1) Get all cached rows and their values from my_pivot_table.cache.records.r. Cache fields in these records are stored as indexes from my_pivot_table.cache.cacheFields
    • 3.2) Replace cache fields from each record by their actual values, by “joining” cache.records.r and cache.cacheFields
  4. Convert dict with rows into a pandas DataFrame

Below you will find a copy of the code that implements such solution. Since the structure of these Excel objects are somewhat complex, the code will probably look very cryptic (sorry about that). To address this, I’m adding further below minimal examples of the main objects being manipulated, so people can get a better sense of what is going on, what are the objects being returned, etc.

This was the simplest approach I could find to achieve this. I hope it is still useful for someone, albeit some tweaking may be needed for individual cases.

“Bare” code

JavaScript

Results:

JavaScript

Some of the objects details

Object pivot_table

This is an object of type openpyxl.pivot.table.TableDefinition. It is quite complex. A small glimpse of it:

JavaScript

Object fields_map (from cache.cacheFields)

This is a dict with column name and their available values:

JavaScript

Object row_dict (before mapping)

Each row is a dict with column names and their values. Raw values for cache fields are not stored here. Here they are represented by their indexes in cache.cacheFields (see above)

JavaScript

Object row_dict (after mapping)

After extracting raw values for cache fields from their indexes, we have a dict that represent all values of a row:

JavaScript
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement