Skip to content
Advertisement

How to extract Excel PivotCache into Pandas Data Frame?

First time posting here, I apologize if this question has been asked before – I can’t find anything that applies.

Is there a way to read the underlying data from an Excel PivotTable into a Pandas Data Frame? For several years I’ve had an Excel Auto_Open macro that downloads several Excel files and double clicks on the “Grand Total” row in order to extract all of the data, which ultimate gets imported into a database. This is done because the owners of the source data refuse to grant access to the database itself.

This macro has never been the ideal scenario and we need to move it to a better method soon. I have extensive SQL knowledge but have only recently begun to learn Python.

I have been able to read worksheets using OpenPyXl, but these files do not contain the source data on a separate worksheet by default – the pivotcache must be extracted to a new sheet first. What I would like to do, if possible, is read from the Excel PivotCache into a Pandas Data Frame and either save that output as a CSV or load it directly into our database. It seems that this is not capable with OpenPyXl and that I’ll probably need to use win32com.client.

Does anybody have any experience with this, and know if it’s even possible? Any pointers for where I might get started? I’ve tried several items from the Excel Object model (PivotCache, GetData, etc etc) but either I don’t know how to use them or they don’t return what I need.

Any help would be much appreciated. Thanks!

Advertisement

Answer

This answer is very late, but I came up with it while struggling with the same issue, and some of the comments above helped me nail it.

In essence, the steps one can take to solve this with openpyxl are:

  1. Use openpyxl to get the openpyxl.pivot.table.TableDefinition object from the desired pivot table (let’s call it my_pivot_table)
  2. Get cached fields and their values from my_pivot_table.cache.cacheFields
  3. Get rows data as a dict in two sub-steps:
    • 3.1) Get all cached rows and their values from my_pivot_table.cache.records.r. Cache fields in these records are stored as indexes from my_pivot_table.cache.cacheFields
    • 3.2) Replace cache fields from each record by their actual values, by “joining” cache.records.r and cache.cacheFields
  4. Convert dict with rows into a pandas DataFrame

Below you will find a copy of the code that implements such solution. Since the structure of these Excel objects are somewhat complex, the code will probably look very cryptic (sorry about that). To address this, I’m adding further below minimal examples of the main objects being manipulated, so people can get a better sense of what is going on, what are the objects being returned, etc.

This was the simplest approach I could find to achieve this. I hope it is still useful for someone, albeit some tweaking may be needed for individual cases.

“Bare” code

import numpy as np
import pandas as pd
from openpyxl import load_workbook
from openpyxl.pivot.fields import Missing

file_path = 'path/to/your/file.xlsx'

workbook = load_workbook(file_path)
worksheet = workbook['Plan1']

# Name of desired pivot table (the same name that appears within Excel)
pivot_name = 'Tabela dinĂ¢mica1'

# Extract the pivot table object from the worksheet
pivot_table = [p for p in worksheet._pivots if p.name == pivot_name][0]

# Extract a dict of all cache fields and their respective values
fields_map = {}
for field in pivot_table.cache.cacheFields:
    if field.sharedItems.count > 0:
        fields_map[field.name] = [f.v for f in field.sharedItems._fields]

# Extract all rows from cache records. Each row is initially parsed as a dict
column_names = [field.name for field in pivot_table.cache.cacheFields]
rows = []
for record in pivot_table.cache.records.r:
    # If some field in the record in missing, we replace it by NaN
    record_values = [
        field.v if not isinstance(field, Missing) else np.nan for field in record._fields
    ]

    row_dict = {k: v for k, v in zip(column_names, record_values)}

    # Shared fields are mapped as an Index, so we replace the field index by its value
    for key in fields_map:
        row_dict[key] = fields_map[key][row_dict[key]]

    rows.append(row_dict)

df = pd.DataFrame.from_dict(rows)

Results:

>>> df.head(2)

            FUEL    YEAR   REGION     STATE  UNIT       Jan       Feb  (...)
0  GASOLINE (m3)  2000.0        S     TEXAS    m3  9563.263  9563.263  (...)
1  GASOLINE (m3)  2000.0       NE  NEW YORK    m3  3065.758  9563.263  (...)

Some of the objects details

Object pivot_table

This is an object of type openpyxl.pivot.table.TableDefinition. It is quite complex. A small glimpse of it:

<openpyxl.pivot.table.TableDefinition object>
Parameters:
name='Tabela dinĂ¢mica1', cacheId=36, dataOnRows=True, dataPosition=None, (A LOT OF OMITTED STUFF...)
Parameters:
ref='B52:W66', firstHeaderRow=1, firstDataRow=2, firstDataCol=1, rowPageCount=2, colPageCount=1, pivotFields=[<openpyxl.pivot.table.PivotField object>
Parameters: (A LOT OF OMITTED STUFF...)

Object fields_map (from cache.cacheFields)

This is a dict with column name and their available values:

{'YEAR': [2000.0, 2001.0, 2002.0, 2003.0, 2004.0, 2005.0, 2006.0, 2007.0, 2008.0,
          2009.0, 2010.0, 2011.0, 2012.0, 2013.0, 2014.0, 2015.0, 2016.0, 2017.0,
          2018.0, 2019.0, 2020.0],
 'FUEL': ['GASOLINE (m3)', 'AVIATION GASOLINE (m3)', 'KEROSENE (m3)'],
 'STATE': ['TEXAS', 'NEW YORK', 'MAINE', (...)],
 'REGION': ['S', 'NE', 'N', (...)]}

Object row_dict (before mapping)

Each row is a dict with column names and their values. Raw values for cache fields are not stored here. Here they are represented by their indexes in cache.cacheFields (see above)

{'YEAR': 0,  # <<<--- 0 stands for index in fields_map
 'Jan': 10719.983,
 'Feb': 12482.281,
 'FUEL': 0,  # <<<--- index in fields_map
 'Dec': 10818.094,
 'STATE': 0,  # <<<--- index in fields_map
 (...)
 'UNIT': 'm3'}

Object row_dict (after mapping)

After extracting raw values for cache fields from their indexes, we have a dict that represent all values of a row:

{'YEAR': 2000.0,  # extracted column value from index in fields_map
 'Jan': 10719.983,
 'Feb': 12482.281,
 'FUEL': 'GASOLINE (m3)',  # extracted from fields_map
 'Dec': 10818.094,
 'STATE': 'TEXAS',  # extracted from fields_map
 (...)
 'UNIT': 'm3'}
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement